German MoE GPT v8 - OPUS EDITION
A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
Note: While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
Model Description
This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
Key Features
- 🏗️ Hybrid Dense + MoE Architecture: Every 2nd layer uses MoE for efficiency
- 🔬 Research-Backed: Implements ST-MoE and Switch Transformer best practices
- ⚡ Efficient: Only ~33% of parameters active per token
- 🖥️ Cross-Platform: Pure PyTorch, runs on Windows/Linux/macOS
- 🤗 HuggingFace Compatible: Full integration with
transformerslibrary
Model Specifications
| Specification | Value |
|---|---|
| Total Parameters | 149.6M |
| Active Parameters per Token | |
| Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
| Context Length | 2048 tokens |
| Architecture | Hybrid Dense + MoE Transformer |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Experts per MoE Layer | 32 |
| Active Experts (Top-k) | 2 |
| Position Embeddings | RoPE (Rotary Position Embeddings) |
Training Data
The model was trained on a 17.4 GB curated German corpus consisting of:
- Clean German Wikipedia (~11 GB): Encyclopedic knowledge
- OpenSubtitles (German): Natural dialog and conversational language
- Belletristik: German literature for style and creativity
Data Quality: Deduplicated and SEO spam filtered for high-quality training signal.
Adapting to other languages: The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
Training Details
Training Hyperparameters
- Steps: 300,000
- Batch Size: 32 (with gradient accumulation)
- Learning Rate: 3e-4 (max)
- Hardware: Single RTX 4090 (24GB VRAM)
- Training Time: ~120 hours
- Precision: Mixed (BF16)
Results
| Metric | Initial | Final | Improvement |
|---|---|---|---|
| Training Loss | 12.0 | 2.55 | 79% ↓ |
| Validation Loss | 4.58 | 2.40 | 48% ↓ |
| Perplexity | - | 11.0 | - |
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
# Generate text
prompt = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_k=50,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced Usage
# Generate with custom parameters
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7, # Lower = more deterministic
top_k=40, # Top-k sampling
top_p=0.95, # Nucleus sampling
repetition_penalty=1.1, # Reduce repetition
do_sample=True
)
Technical Architecture
MoE Layer Design
The model uses a Noisy Top-k Router with the following components:
- Gate Computation: Learned routing weights per expert
- Noise Injection: Adds controlled noise during training for exploration
- Top-k Selection: Routes each token to the 2 best experts
- Capacity Management: Prevents expert overload with dynamic capacity limits
- Load Balancing: Auxiliary loss ensures uniform expert utilization
Loss Functions
The training loss combines three components:
L_total = L_ce + α * L_aux + β * L_z
- L_ce: Cross-entropy language modeling loss
- L_aux: Load balance loss (α = 0.01) for uniform expert utilization
- L_z: Router z-loss (β = 0.001) for numerical stability
Attention Mechanism
- RoPE (Rotary Position Embeddings) for position encoding
- PyTorch SDPA with automatic backend selection (Flash Attention when available)
- Causal masking for autoregressive generation
Optimizations
- ✅ Gradient Checkpointing: ~40% VRAM reduction
- ✅ Mixed Precision (BF16): 2x faster training
- ✅ Weight Tying: LM head shares embeddings
- ✅ Batch Expert Processing: Parallel computation for all experts
Limitations and Biases
- Language: Primarily trained on German text
- Domain: General domain (Wikipedia, literature, subtitles)
- Biases: May reflect biases present in training data
- Context: Limited to 2048 tokens
- Compute: Requires GPU for efficient inference
Ethical Considerations
This model is a language model and can generate text that may be:
- Factually incorrect
- Biased or stereotypical
- Inappropriate or offensive
Users should:
- Verify generated content for factual accuracy
- Be aware of potential biases
- Use appropriate content filtering for production applications
Citation
If you use this model in your research, please cite:
@misc{german-moe-gpt-v8,
title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
author={[Your Name]},
year={2025},
howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
}
References
This implementation is based on:
- ST-MoE: Zoph et al. (2022) - Designing Effective Sparse Expert Models
- Switch Transformer: Fedus et al. (2022) - Switch Transformers: Scaling to Trillion Parameter Models
- RoFormer: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position Embedding
License
MIT License - See LICENSE file for details
Acknowledgments
- HuggingFace Transformers team for the excellent framework
- PyTorch team for SDPA and optimized operations
- nanoGPT/nanoMoE community for inspiration
Model Card Contact
For questions or feedback, please open an issue in the GitHub repository.
- Downloads last month
- 137