XVERSE-Ent-A4.2B

Model Introduction

The XVERSE-Ent series is a family of domain-specific, pretrained models developed by XVERSE (Shenzhen Yuanxiang) for the Entertainment domain. These models are trained and optimized on XVERSE’s large-scale, high-quality entertainment-domain data, thereby filling the gap for open-source large language models in the entertainment domain.

XVERSE-Ent includes both Chinese and English models:

  • XVERSE-Ent-A4.2B (Chinese)
  • XVERSE-Ent-A5.7B (English)

Both models adopt a Mixture-of-Experts (MoE) architecture. Detailed technical information is provided below.

Model XVERSE-Ent-A4.2B XVERSE-Ent-A5.7B
Language Chinese English
Training Recipe Multi-stage Training Fine-grained Upcycling + Multi-stage Training
Total Parameters 25B 36B
Activated Parameters 4.2B 5.7B
Number of Layers 28 32
Hidden Dimension 2560 3072
Number of Attention Heads 32 32
Number of Shared Experts 2 2
Number of Non-Shared Experts 64 64
Selected Experts per Token 8 8
Vocabulary Size 100K 128K
Context Length 8K 8K

Technical Overview

XVERSE-Ent leverages Sparse Upcycling to convert a dense model into a large-scale MoE model. Combined with a carefully designed multi-stage training strategy, the models substantially enhance domain-specific capabilities while preserving most general-purpose abilities.

Sparse Upcycling

Sparse Upcycling is a technique that transforms a pretrained dense model into a MoE model without training from scratch. This approach enables a significant increase in total model capacity while substantially reducing training cost and time.

The upcycling process consists of two main steps:

  1. Fine-grained FFN Decomposition
    The Feed-Forward Network (FFN) layers of the dense model are decomposed into multiple smaller sub-networks. Each sub-network is treated as an independent expert in the MoE model. To better accommodate inference-time GPU memory constraints, expert sub-networks can be replicated as needed, enabling flexible adaptation to different hardware configurations.

  2. Attention Reuse
    The attention layers of the original dense model are preserved and directly reused in the MoE model. This design choice maximizes the retention of the original model’s general-purpose capabilities and ensures training stability during architectural transformation.

An illustration of the fine-grained FFN decomposition is shown below. In this example, a single FFN is split into two sub-networks, each serving as a separate expert.

In contrast, when the FFN is not decomposed, and the entire FFN is treated as a single expert, the approach we call coarse-grained decomposition is illustrated below.

In our experiments and practical deployments, the fine-grained decomposition strategy consistently yields better overall performance. Moreover, it provides greater flexibility in configuring expert size, expert count, and memory usage, enabling the model architecture to adapt more effectively to diverse hardware environments and deployment scenarios.

Multi-Stage Training Strategy

XVERSE-Ent adopts a three-stage training pipeline to build models optimized for specific languages and domains:

  • S0: Capability Reconstruction – recovering general-purpose capabilities after architectural transformation
  • S1: Language Enhancement – enhancing the model’s modeling capability for the target language
  • S2: Domain Enhancement – enhancing the model’s generation and understanding abilities in the entertainment domain

The first two stages use general-domain data, while the final stage uses a mixture of general-domain and entertainment-domain data. This multi-stage design maximizes retention of general capabilities while significantly improving domain-specific performance.

XVERSE-Ent-A4.2B (Chinese) is obtained by applying S2 domain-enhancement training on the general-domain backbone XVERSE-MoE-A4.2B.
XVERSE-Ent-A5.7B (English) is built by transforming a general dense backbone into a MoE model via fine-grained MoE upcycling, followed by the full multi-stage training pipeline. Both the Chinese and English models support an 8K context window and are trained on ~1T tokens.

Model Evaluation

To evaluate domain-specific performance, we constructed multiple evaluation datasets across different domains:

  • fiction: novel and story-oriented texts
  • conversation: dialogue-oriented texts
  • webcc: general web text

The evaluation metric is Perplexity (PPL), where lower values indicate better performance.

The evaluation results demonstrate that XVERSE-Ent achieves excellent performance on entertainment tasks such as fiction writing and conversational generation, while maintaining strong general capabilities:

  • Performance on general benchmarks (e.g., MMLU, mathematics, and code) shows minimal degradation
  • Overall general capability retention exceeds 98%

These results confirm that XVERSE-Ent effectively enhances entertainment-domain performance without sacrificing general-purpose reasoning ability.

Perplexity XVERSE-MoE-A4.2B
(General-domain)
XVERSE-Ent-A4.2B
(Entertainment-domain)
XfictionEN 1.7374 1.7516
XfictionZH 1.7315 1.5368
XwebccEN 1.5519 1.6008
XwebccZH 1.5646 1.6861
XconversationEN 1.3918 1.3517
XconversationZH 1.413 1.3353
AVG(all) 1.5650 1.5437
AVG(fiction + conversation) 1.5684 1.4939
Perplexity Dense-Base Coarse-grained Upcycling
(General-domain)
Coarse-grained Upcycling
(Entertainment-domain)
Fine-grained Upcycling
(General-domain)
XVERSE-Ent-A5.7B
(Fine-grained Upcycling, Entertainment-domain)
XfictionEN 2.5991 2.6089 2.3740 2.5943 2.3620
XfictionZH 2.7344 2.2565 2.1723 2.2437 2.1528
XwebccEN 2.5475 2.4711 2.4557 2.4485 2.4283
XwebccZH 2.9299 2.0791 2.1116 2.0651 2.0905
XconversationEN 2.1443 2.0892 1.9439 2.0711 1.9194
XconversationZH 2.3478 1.8610 1.8513 1.8249 1.8045
AVG(all) 2.5505 2.2276 2.1515 2.2079 2.1263
AVG(fiction + conversation) 2.4564 2.2039 2.0854 2.1835 2.0597

Usage

Loading with Transformers

The XVERSE-Ent-A4.2B model can be loaded for inference using the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-Ent-A4.2B")
model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-Ent-A4.2B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
model = model.eval()
inputs = tokenizer('时间一分一秒地过去。雨声、冰箱偶尔的嗡鸣、墙壁里不知名管道的水流声,全都被放大。林屿意识到自己在数呼吸,仿佛只要停下来,房间里就会多出一个不属于他的存在。', return_tensors='pt').input_ids
inputs = inputs.cuda()
generated_ids = model.generate(inputs, max_new_tokens=70, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))

Limitations and Disclaimer

Like all other Large Language Models (LLMs), XVERSE-Ent series models may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the model-generated content with caution and refrain from disseminating harmful content. Before deploying any application based on the XVERSE-Ent series models, developers should conduct safety tests and optimize the model for its specific application.

We strongly discourage the use of the XVERSE-Ent series models for producing or disseminating harmful information, or for conducting any activities that might harm the public, national security, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-Ent series models, including data security issues, public opinion risks, or risks arising from misunderstanding, misuse, dissemination, or non-compliance with the model.

Open Source License

The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-Ent series models needs to adhere to the Model License Agreement.

The weights of XVERSE-Ent series models are fully open to academic research and support unrestricted commercial use. For other questions or collaborations, please contact [email protected].

Downloads last month
51
Safetensors
Model size
26B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xverse/XVERSE-Ent-A4.2B

Finetuned
(1)
this model

Collection including xverse/XVERSE-Ent-A4.2B