Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,254 +1,203 @@
|
|
| 1 |
-
# German MoE GPT v8 - OPUS EDITION
|
| 2 |
-
|
| 3 |
-
A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
|
| 4 |
-
|
| 5 |
-
> **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
|
| 6 |
-
|
| 7 |
-
##
|
| 8 |
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
- **
|
| 40 |
-
- **
|
| 41 |
-
- **
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
- **
|
| 54 |
-
- **
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
#
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
###
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
-
|
| 146 |
-
- Mixed
|
| 147 |
-
-
|
| 148 |
-
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
└── samples_v8_clean/ # Generated text samples
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
## 🔬 Technical Details
|
| 208 |
-
|
| 209 |
-
### MoE Router Algorithm
|
| 210 |
-
|
| 211 |
-
The router uses a **Noisy Top-k Gating Mechanism**:
|
| 212 |
-
|
| 213 |
-
1. **Gate Computation:** `router_logits = W_gate @ hidden_states`
|
| 214 |
-
2. **Noise Injection (Training):** `router_logits += softplus(W_noise @ hidden_states) * ε`
|
| 215 |
-
3. **Top-k Selection:** Selects the k best experts per token
|
| 216 |
-
4. **Capacity Management:** Limits tokens per expert (prevents overload)
|
| 217 |
-
5. **Weighted Routing:** Tokens are routed to experts with weights
|
| 218 |
-
|
| 219 |
-
### Loss Functions
|
| 220 |
-
|
| 221 |
-
**Total Loss:**
|
| 222 |
-
```
|
| 223 |
-
L_total = L_ce + α * L_aux + β * L_z
|
| 224 |
-
```
|
| 225 |
-
|
| 226 |
-
- **L_ce:** Cross-entropy language modeling loss
|
| 227 |
-
- **L_aux:** Load balance loss (expert utilization)
|
| 228 |
-
- **L_z:** Router z-loss (numerical stability)
|
| 229 |
-
- **α = 0.01, β = 0.001:** Empirically optimized weights
|
| 230 |
-
|
| 231 |
-
### Memory Optimization
|
| 232 |
-
|
| 233 |
-
- **Gradient Checkpointing:** Reduces VRAM usage by ~40%
|
| 234 |
-
- **Mixed Precision (BF16):** 2x faster training
|
| 235 |
-
- **Gradient Accumulation:** Simulates larger batch sizes
|
| 236 |
-
- **Weight Tying:** LM head shares weights with token embeddings
|
| 237 |
-
|
| 238 |
-
## 📚 References
|
| 239 |
-
|
| 240 |
-
This project implements techniques from the following research papers:
|
| 241 |
-
|
| 242 |
-
- **ST-MoE:** [Zoph et al. 2022 - "Designing Effective Sparse Expert Models"](https://arxiv.org/abs/2202.08906)
|
| 243 |
-
- **Switch Transformer:** [Fedus et al. 2022 - "Switch Transformers"](https://arxiv.org/abs/2101.03961)
|
| 244 |
-
- **RoFormer:** [Su et al. 2021 - "RoFormer: Enhanced Transformer with Rotary Position Embedding"](https://arxiv.org/abs/2104.09864)
|
| 245 |
-
|
| 246 |
-
## 📄 License
|
| 247 |
-
|
| 248 |
-
MIT
|
| 249 |
-
|
| 250 |
-
## 🙏 Acknowledgments
|
| 251 |
-
|
| 252 |
-
- HuggingFace Transformers team for the excellent framework
|
| 253 |
-
- PyTorch team for SDPA and optimized operations
|
| 254 |
-
- nanoGPT/nanoMoE community for inspiration
|
|
|
|
| 1 |
+
# German MoE GPT v8 - OPUS EDITION
|
| 2 |
+
|
| 3 |
+
A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
|
| 4 |
+
|
| 5 |
+
> **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
|
| 6 |
+
|
| 7 |
+
## Model Description
|
| 8 |
+
|
| 9 |
+
This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
|
| 10 |
+
|
| 11 |
+
### Key Features
|
| 12 |
+
|
| 13 |
+
- 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
|
| 14 |
+
- 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
|
| 15 |
+
- ⚡ **Efficient:** Only ~33% of parameters active per token
|
| 16 |
+
- 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
|
| 17 |
+
- 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
|
| 18 |
+
|
| 19 |
+
## Model Specifications
|
| 20 |
+
|
| 21 |
+
| Specification | Value |
|
| 22 |
+
|--------------|-------|
|
| 23 |
+
| Total Parameters | 149.6M |
|
| 24 |
+
| Active Parameters per Token | ~49.9M (~33%) |
|
| 25 |
+
| Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
|
| 26 |
+
| Context Length | 2048 tokens |
|
| 27 |
+
| Architecture | Hybrid Dense + MoE Transformer |
|
| 28 |
+
| Layers | 12 |
|
| 29 |
+
| Hidden Size | 768 |
|
| 30 |
+
| Attention Heads | 12 |
|
| 31 |
+
| Experts per MoE Layer | 32 |
|
| 32 |
+
| Active Experts (Top-k) | 2 |
|
| 33 |
+
| Position Embeddings | RoPE (Rotary Position Embeddings) |
|
| 34 |
+
|
| 35 |
+
## Training Data
|
| 36 |
+
|
| 37 |
+
The model was trained on a 17.4 GB curated German corpus consisting of:
|
| 38 |
+
|
| 39 |
+
- **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
|
| 40 |
+
- **OpenSubtitles (German)**: Natural dialog and conversational language
|
| 41 |
+
- **Belletristik**: German literature for style and creativity
|
| 42 |
+
|
| 43 |
+
**Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
|
| 44 |
+
|
| 45 |
+
> **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
|
| 46 |
+
|
| 47 |
+
## Training Details
|
| 48 |
+
|
| 49 |
+
### Training Hyperparameters
|
| 50 |
+
|
| 51 |
+
- **Steps:** 300,000
|
| 52 |
+
- **Batch Size:** 32 (with gradient accumulation)
|
| 53 |
+
- **Learning Rate:** 3e-4 (max)
|
| 54 |
+
- **Hardware:** Single RTX 4090 (24GB VRAM)
|
| 55 |
+
- **Training Time:** ~120 hours
|
| 56 |
+
- **Precision:** Mixed (BF16)
|
| 57 |
+
|
| 58 |
+
### Results
|
| 59 |
+
|
| 60 |
+
| Metric | Initial | Final | Improvement |
|
| 61 |
+
|--------|---------|-------|-------------|
|
| 62 |
+
| Training Loss | 12.0 | 2.55 | 79% ↓ |
|
| 63 |
+
| Validation Loss | 4.58 | 2.40 | 48% ↓ |
|
| 64 |
+
| Perplexity | - | 11.0 | - |
|
| 65 |
+
|
| 66 |
+
## Usage
|
| 67 |
+
|
| 68 |
+
### Installation
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
pip install transformers torch
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Quick Start
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 78 |
+
|
| 79 |
+
# Load model and tokenizer
|
| 80 |
+
model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
|
| 81 |
+
tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
|
| 82 |
+
|
| 83 |
+
# Generate text
|
| 84 |
+
prompt = "Die Hauptstadt von Deutschland ist"
|
| 85 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 86 |
+
outputs = model.generate(
|
| 87 |
+
**inputs,
|
| 88 |
+
max_new_tokens=100,
|
| 89 |
+
temperature=0.8,
|
| 90 |
+
top_k=50,
|
| 91 |
+
top_p=0.9,
|
| 92 |
+
do_sample=True
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
### Advanced Usage
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
# Generate with custom parameters
|
| 102 |
+
outputs = model.generate(
|
| 103 |
+
**inputs,
|
| 104 |
+
max_new_tokens=200,
|
| 105 |
+
temperature=0.7, # Lower = more deterministic
|
| 106 |
+
top_k=40, # Top-k sampling
|
| 107 |
+
top_p=0.95, # Nucleus sampling
|
| 108 |
+
repetition_penalty=1.1, # Reduce repetition
|
| 109 |
+
do_sample=True
|
| 110 |
+
)
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## Technical Architecture
|
| 114 |
+
|
| 115 |
+
### MoE Layer Design
|
| 116 |
+
|
| 117 |
+
The model uses a **Noisy Top-k Router** with the following components:
|
| 118 |
+
|
| 119 |
+
1. **Gate Computation:** Learned routing weights per expert
|
| 120 |
+
2. **Noise Injection:** Adds controlled noise during training for exploration
|
| 121 |
+
3. **Top-k Selection:** Routes each token to the 2 best experts
|
| 122 |
+
4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
|
| 123 |
+
5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
|
| 124 |
+
|
| 125 |
+
### Loss Functions
|
| 126 |
+
|
| 127 |
+
The training loss combines three components:
|
| 128 |
+
|
| 129 |
+
```
|
| 130 |
+
L_total = L_ce + α * L_aux + β * L_z
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
- **L_ce:** Cross-entropy language modeling loss
|
| 134 |
+
- **L_aux:** Load balance loss (α = 0.01) for uniform expert utilization
|
| 135 |
+
- **L_z:** Router z-loss (β = 0.001) for numerical stability
|
| 136 |
+
|
| 137 |
+
### Attention Mechanism
|
| 138 |
+
|
| 139 |
+
- **RoPE (Rotary Position Embeddings)** for position encoding
|
| 140 |
+
- **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
|
| 141 |
+
- **Causal masking** for autoregressive generation
|
| 142 |
+
|
| 143 |
+
### Optimizations
|
| 144 |
+
|
| 145 |
+
- ✅ **Gradient Checkpointing:** ~40% VRAM reduction
|
| 146 |
+
- ✅ **Mixed Precision (BF16):** 2x faster training
|
| 147 |
+
- ✅ **Weight Tying:** LM head shares embeddings
|
| 148 |
+
- ✅ **Batch Expert Processing:** Parallel computation for all experts
|
| 149 |
+
|
| 150 |
+
## Limitations and Biases
|
| 151 |
+
|
| 152 |
+
- **Language:** Primarily trained on German text
|
| 153 |
+
- **Domain:** General domain (Wikipedia, literature, subtitles)
|
| 154 |
+
- **Biases:** May reflect biases present in training data
|
| 155 |
+
- **Context:** Limited to 2048 tokens
|
| 156 |
+
- **Compute:** Requires GPU for efficient inference
|
| 157 |
+
|
| 158 |
+
## Ethical Considerations
|
| 159 |
+
|
| 160 |
+
This model is a language model and can generate text that may be:
|
| 161 |
+
- Factually incorrect
|
| 162 |
+
- Biased or stereotypical
|
| 163 |
+
- Inappropriate or offensive
|
| 164 |
+
|
| 165 |
+
Users should:
|
| 166 |
+
- Verify generated content for factual accuracy
|
| 167 |
+
- Be aware of potential biases
|
| 168 |
+
- Use appropriate content filtering for production applications
|
| 169 |
+
|
| 170 |
+
## Citation
|
| 171 |
+
|
| 172 |
+
If you use this model in your research, please cite:
|
| 173 |
+
|
| 174 |
+
```bibtex
|
| 175 |
+
@misc{german-moe-gpt-v8,
|
| 176 |
+
title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
|
| 177 |
+
author={[Your Name]},
|
| 178 |
+
year={2025},
|
| 179 |
+
howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
## References
|
| 184 |
+
|
| 185 |
+
This implementation is based on:
|
| 186 |
+
|
| 187 |
+
- **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
|
| 188 |
+
- **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
|
| 189 |
+
- **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
|
| 190 |
+
|
| 191 |
+
## License
|
| 192 |
+
|
| 193 |
+
MIT License - See LICENSE file for details
|
| 194 |
+
|
| 195 |
+
## Acknowledgments
|
| 196 |
+
|
| 197 |
+
- HuggingFace Transformers team for the excellent framework
|
| 198 |
+
- PyTorch team for SDPA and optimized operations
|
| 199 |
+
- nanoGPT/nanoMoE community for inspiration
|
| 200 |
+
|
| 201 |
+
## Model Card Contact
|
| 202 |
+
|
| 203 |
+
For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|