XVERSE-Ent-A4.2B
Model Introduction
The XVERSE-Ent series is a family of domain-specific, pretrained models developed by XVERSE (Shenzhen Yuanxiang) for the Entertainment domain. These models are trained and optimized on XVERSE’s large-scale, high-quality entertainment-domain data, thereby filling the gap for open-source large language models in the entertainment domain.
XVERSE-Ent includes both Chinese and English models:
- XVERSE-Ent-A4.2B (Chinese)
- XVERSE-Ent-A5.7B (English)
Both models adopt a Mixture-of-Experts (MoE) architecture. Detailed technical information is provided below.
| Model | XVERSE-Ent-A4.2B | XVERSE-Ent-A5.7B |
|---|---|---|
| Language | Chinese | English |
| Training Recipe | Multi-stage Training | Fine-grained Upcycling + Multi-stage Training |
| Total Parameters | 25B | 36B |
| Activated Parameters | 4.2B | 5.7B |
| Number of Layers | 28 | 32 |
| Hidden Dimension | 2560 | 3072 |
| Number of Attention Heads | 32 | 32 |
| Number of Shared Experts | 2 | 2 |
| Number of Non-Shared Experts | 64 | 64 |
| Selected Experts per Token | 8 | 8 |
| Vocabulary Size | 100K | 128K |
| Context Length | 8K | 8K |
Technical Overview
XVERSE-Ent leverages Sparse Upcycling to convert a dense model into a large-scale MoE model. Combined with a carefully designed multi-stage training strategy, the models substantially enhance domain-specific capabilities while preserving most general-purpose abilities.
Sparse Upcycling
Sparse Upcycling is a technique that transforms a pretrained dense model into a MoE model without training from scratch. This approach enables a significant increase in total model capacity while substantially reducing training cost and time.
The upcycling process consists of two main steps:
Fine-grained FFN Decomposition
The Feed-Forward Network (FFN) layers of the dense model are decomposed into multiple smaller sub-networks. Each sub-network is treated as an independent expert in the MoE model. To better accommodate inference-time GPU memory constraints, expert sub-networks can be replicated as needed, enabling flexible adaptation to different hardware configurations.Attention Reuse
The attention layers of the original dense model are preserved and directly reused in the MoE model. This design choice maximizes the retention of the original model’s general-purpose capabilities and ensures training stability during architectural transformation.
An illustration of the fine-grained FFN decomposition is shown below. In this example, a single FFN is split into two sub-networks, each serving as a separate expert.
In contrast, when the FFN is not decomposed, and the entire FFN is treated as a single expert, the approach we call coarse-grained decomposition is illustrated below.
Multi-Stage Training Strategy
XVERSE-Ent adopts a three-stage training pipeline to build models optimized for specific languages and domains:
- S0: Capability Reconstruction – recovering general-purpose capabilities after architectural transformation
- S1: Language Enhancement – enhancing the model’s modeling capability for the target language
- S2: Domain Enhancement – enhancing the model’s generation and understanding abilities in the entertainment domain
The first two stages use general-domain data, while the final stage uses a mixture of general-domain and entertainment-domain data. This multi-stage design maximizes retention of general capabilities while significantly improving domain-specific performance.
XVERSE-Ent-A4.2B (Chinese) is obtained by applying S2 domain-enhancement training on the general-domain backbone XVERSE-MoE-A4.2B.
XVERSE-Ent-A5.7B (English) is built by transforming a general dense backbone into a MoE model via fine-grained MoE upcycling, followed by the full multi-stage training pipeline.
Both the Chinese and English models support an 8K context window and are trained on ~1T tokens.
Model Evaluation
To evaluate domain-specific performance, we constructed multiple evaluation datasets across different domains:
- fiction: novel and story-oriented texts
- conversation: dialogue-oriented texts
- webcc: general web text
The evaluation metric is Perplexity (PPL), where lower values indicate better performance.
The evaluation results demonstrate that XVERSE-Ent achieves excellent performance on entertainment tasks such as fiction writing and conversational generation, while maintaining strong general capabilities:
- Performance on general benchmarks (e.g., MMLU, mathematics, and code) shows minimal degradation
- Overall general capability retention exceeds 98%
These results confirm that XVERSE-Ent effectively enhances entertainment-domain performance without sacrificing general-purpose reasoning ability.
| Perplexity | XVERSE-MoE-A4.2B (General-domain) |
XVERSE-Ent-A4.2B (Entertainment-domain) |
|---|---|---|
| XfictionEN | 1.7374 | 1.7516 |
| XfictionZH | 1.7315 | 1.5368 |
| XwebccEN | 1.5519 | 1.6008 |
| XwebccZH | 1.5646 | 1.6861 |
| XconversationEN | 1.3918 | 1.3517 |
| XconversationZH | 1.413 | 1.3353 |
| AVG(all) | 1.5650 | 1.5437 |
| AVG(fiction + conversation) | 1.5684 | 1.4939 |
| Perplexity | Dense-Base | Coarse-grained Upcycling (General-domain) |
Coarse-grained Upcycling (Entertainment-domain) |
Fine-grained Upcycling (General-domain) |
XVERSE-Ent-A5.7B (Fine-grained Upcycling, Entertainment-domain) |
|---|---|---|---|---|---|
| XfictionEN | 2.5991 | 2.6089 | 2.3740 | 2.5943 | 2.3620 |
| XfictionZH | 2.7344 | 2.2565 | 2.1723 | 2.2437 | 2.1528 |
| XwebccEN | 2.5475 | 2.4711 | 2.4557 | 2.4485 | 2.4283 |
| XwebccZH | 2.9299 | 2.0791 | 2.1116 | 2.0651 | 2.0905 |
| XconversationEN | 2.1443 | 2.0892 | 1.9439 | 2.0711 | 1.9194 |
| XconversationZH | 2.3478 | 1.8610 | 1.8513 | 1.8249 | 1.8045 |
| AVG(all) | 2.5505 | 2.2276 | 2.1515 | 2.2079 | 2.1263 |
| AVG(fiction + conversation) | 2.4564 | 2.2039 | 2.0854 | 2.1835 | 2.0597 |
Usage
Loading with Transformers
The XVERSE-Ent-A4.2B model can be loaded for inference using the following code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-Ent-A4.2B")
model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-Ent-A4.2B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
model = model.eval()
inputs = tokenizer('时间一分一秒地过去。雨声、冰箱偶尔的嗡鸣、墙壁里不知名管道的水流声,全都被放大。林屿意识到自己在数呼吸,仿佛只要停下来,房间里就会多出一个不属于他的存在。', return_tensors='pt').input_ids
inputs = inputs.cuda()
generated_ids = model.generate(inputs, max_new_tokens=70, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
Limitations and Disclaimer
Like all other Large Language Models (LLMs), XVERSE-Ent series models may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the model-generated content with caution and refrain from disseminating harmful content. Before deploying any application based on the XVERSE-Ent series models, developers should conduct safety tests and optimize the model for its specific application.
We strongly discourage the use of the XVERSE-Ent series models for producing or disseminating harmful information, or for conducting any activities that might harm the public, national security, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-Ent series models, including data security issues, public opinion risks, or risks arising from misunderstanding, misuse, dissemination, or non-compliance with the model.
Open Source License
The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-Ent series models needs to adhere to the Model License Agreement.
The weights of XVERSE-Ent series models are fully open to academic research and support unrestricted commercial use. For other questions or collaborations, please contact [email protected].
- Downloads last month
- 51
Model tree for xverse/XVERSE-Ent-A4.2B
Base model
xverse/XVERSE-MoE-A4.2B