arnomatic commited on
Commit
8d3e75f
·
verified ·
1 Parent(s): be8d68c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -254
README.md CHANGED
@@ -1,254 +1,203 @@
1
- # German MoE GPT v8 - OPUS EDITION
2
-
3
- A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
4
-
5
- > **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
6
-
7
- ## Project Status (October 2025)
8
-
9
- - **v8 Pre-Training:** **COMPLETE**
10
- - **Fine-Tuning Phase:** 🔬 **IN PROGRESS**
11
-
12
- ## Overview
13
-
14
- Development of a high-performance language model with state-of-the-art MoE architecture on a single consumer GPU. The v8 model was trained on a 17.4 GB high-quality German corpus and demonstrates strong coherence without SEO spam artifacts.
15
-
16
- ## 🏗️ Architecture
17
-
18
- ### Model Specifications
19
-
20
- - **Total Parameters:** 149.6M
21
- - **Active Parameters per Token:** ~49.9M (~33%)
22
- - **Architecture:** Hybrid Dense + MoE Transformer
23
- - **Experts per MoE Layer:** 32
24
- - **Active Experts (Top-k):** 2
25
- - **Context Length:** 2048 Tokens
26
- - **Vocabulary:** 128,256 (Llama 3.2 Tokenizer)
27
-
28
- ### Core Components
29
-
30
- #### 1. **Mixture-of-Experts Layer**
31
- - **Noisy Top-k Router** with learnable gating mechanism
32
- - **Dynamic Expert Capacity Management** to prevent token overflow
33
- - **Load Balance Loss** (Switch Transformer) for uniform expert utilization
34
- - **Router Z-Loss** (ST-MoE) for numerical stability
35
- - **FP32 Router Computation** to avoid precision issues
36
-
37
- #### 2. **Attention Mechanism**
38
- - **Rotary Position Embeddings (RoPE)** instead of classical positional encodings
39
- - **PyTorch SDPA** (Scaled Dot Product Attention) with automatic backend selection
40
- - **Causal Masking** for autoregressive generation
41
- - **Multi-Head Self-Attention** with 12 heads
42
-
43
- #### 3. **Expert Architecture**
44
- - **Batch Matrix Multiplication** for parallel expert processing
45
- - **SwiGLU Activation** (optional, alongside GELU/ReLU)
46
- - **4x Hidden Dimension** (standard for GPT architecture)
47
- - **Shared Expert Weights** as 3D tensors for efficiency
48
-
49
- #### 4. **HuggingFace Integration**
50
- - Fully compatible with `transformers` library
51
- - Inherits from `PreTrainedModel` and `GenerationMixin`
52
- - Supports `.generate()` for inference
53
- - **Weight Tying** between token embeddings and LM head
54
- - **Gradient Checkpointing** support for memory efficiency
55
-
56
- ### Technical Features
57
-
58
- #### 🔬 **Research-Backed Design**
59
- - Implementation based on **ST-MoE** (Zoph et al. 2022) and **Switch Transformer** (Fedus et al. 2022)
60
- - Auxiliary loss functions for stable MoE training
61
- - Capacity factor management (1.25 training, 2.0 evaluation)
62
- - Expert-specific initialization with fan-in scaling
63
-
64
- #### **Performance & Efficiency**
65
- - **Mixed Dense + MoE Layers** (every 2nd layer is MoE) for optimal parameter utilization
66
- - Batch-based expert processing (no iterative loops)
67
- - Automatic SDPA backend optimization (Flash Attention when available)
68
- - Gradient accumulation & mixed precision training support
69
-
70
- #### 🖥️ **Cross-Platform Compatibility**
71
- - Pure PyTorch implementation without external kernels
72
- - Runs on **Windows, Linux, macOS**
73
- - No CUDA-only dependencies (Liger, Flash Attention libraries)
74
- - `pip install transformers torch` is sufficient for setup
75
-
76
- #### 📊 **Monitoring & Debugging**
77
- - TensorBoard integration for training metrics
78
- - Aux loss & router z-loss tracking
79
- - Sample generation callbacks during training
80
- - Expert load distribution monitoring
81
-
82
- ## 📊 Training Details
83
-
84
- ### Dataset (v8 OPUS Mix - German)
85
-
86
- - **Clean German Wikipedia:** ~11 GB (encyclopedic knowledge)
87
- - **OpenSubtitles (German):** Dialog corpus (natural language)
88
- - **Belletristik:** German literature corpus (style & creativity)
89
- - **Total Size:** ~17.4 GB
90
- - **Quality:** Deduplicated, SEO spam filtered
91
-
92
- > **Adapting to other languages:** Replace the dataset with your target language corpus. The architecture supports any tokenizer and language.
93
-
94
- ### Pre-Training Results
95
-
96
- - **Training Progress:** 300,000 / 300,000 steps
97
- - **Training Loss:** 12.0 → 2.55 (79% reduction)
98
- - **Validation Loss:** 4.58 → 2.40 (48% reduction)
99
- - **Final Perplexity:** **11.0** (exp(2.40))
100
- - **Total Training Time:** ~120 hours (RTX 4090)
101
- - **Hardware:** Single consumer GPU (24GB VRAM)
102
-
103
- ### Configuration
104
-
105
- ```python
106
- # Architecture
107
- n_layer = 12 # Transformer blocks
108
- n_embd = 768 # Hidden dimension
109
- n_head = 12 # Attention heads
110
- n_experts = 32 # Experts per MoE layer
111
- n_experts_active = 2 # Top-k routing
112
- moe_layer_frequency = 2 # Every 2nd layer is MoE
113
-
114
- # Training
115
- batch_size = 32
116
- gradient_accumulation_steps = 4
117
- max_lr = 3e-4
118
- capacity_factor = 1.25 # Expert capacity
119
- aux_loss_alpha = 0.01 # Load balance loss weight
120
- router_z_loss_alpha = 0.001 # Router z-loss weight
121
- ```
122
-
123
- ## 🚀 Usage
124
-
125
- ### Installation
126
-
127
- ```bash
128
- # Create conda environment
129
- conda create -n german_moe python=3.10
130
- conda activate german_moe
131
-
132
- # Install dependencies
133
- pip install -r requirements.txt
134
- ```
135
-
136
- ### Start / Resume Training
137
-
138
- The training script automatically detects existing checkpoints and resumes training:
139
-
140
- ```bash
141
- python train_moe_v8_clean.py
142
- ```
143
-
144
- **Key Features:**
145
- - Automatic checkpoint recovery
146
- - Mixed precision training (FP16/BF16)
147
- - Gradient accumulation
148
- - Sample generation during training
149
- - TensorBoard logging
150
-
151
- ### Inference / Text Generation
152
-
153
- ```bash
154
- python inference.py
155
- ```
156
-
157
- **Example Usage:**
158
-
159
- ```python
160
- from transformers import AutoTokenizer, AutoModelForCausalLM
161
-
162
- # Load model
163
- model = AutoModelForCausalLM.from_pretrained("./moe_final_v8_clean")
164
- tokenizer = AutoTokenizer.from_pretrained("./moe_final_v8_clean")
165
-
166
- # Generate text
167
- prompt = "Die Hauptstadt von Deutschland ist"
168
- inputs = tokenizer(prompt, return_tensors="pt")
169
- outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
170
- print(tokenizer.decode(outputs[0]))
171
- ```
172
-
173
- ### Monitoring
174
-
175
- ```bash
176
- # Start TensorBoard
177
- tensorboard --logdir=./logs_v8_clean
178
-
179
- # or on Windows:
180
- start_tensorboard.bat
181
-
182
- # Watch generated samples
183
- tail -f samples_v8_clean/generation_log.txt # Linux/Mac
184
- Get-Content samples_v8_clean/generation_log.txt -Wait # Windows PowerShell
185
-
186
- # Check GPU utilization
187
- nvidia-smi -l 1
188
- ```
189
-
190
- ## 📁 Project Structure
191
-
192
- ```
193
- german-moe-gpt-v8/
194
- ├── moe_model.py # Main model definition
195
- ├── moe_layers.py # MoE layer & router
196
- ├── moe_config.py # Configuration (HF-compatible)
197
- ├── moe_trainer.py # Custom trainer
198
- ├── train_moe_v8_clean.py # Training script
199
- ├── inference.py # Inference script
200
- ├── sample_generation_callback.py # Training callback
201
- ├── moe_checkpoints_v8_clean/ # Training checkpoints
202
- ├── moe_final_v8_clean/ # Final models
203
- ├── logs_v8_clean/ # TensorBoard logs
204
- └── samples_v8_clean/ # Generated text samples
205
- ```
206
-
207
- ## 🔬 Technical Details
208
-
209
- ### MoE Router Algorithm
210
-
211
- The router uses a **Noisy Top-k Gating Mechanism**:
212
-
213
- 1. **Gate Computation:** `router_logits = W_gate @ hidden_states`
214
- 2. **Noise Injection (Training):** `router_logits += softplus(W_noise @ hidden_states) * ε`
215
- 3. **Top-k Selection:** Selects the k best experts per token
216
- 4. **Capacity Management:** Limits tokens per expert (prevents overload)
217
- 5. **Weighted Routing:** Tokens are routed to experts with weights
218
-
219
- ### Loss Functions
220
-
221
- **Total Loss:**
222
- ```
223
- L_total = L_ce + α * L_aux + β * L_z
224
- ```
225
-
226
- - **L_ce:** Cross-entropy language modeling loss
227
- - **L_aux:** Load balance loss (expert utilization)
228
- - **L_z:** Router z-loss (numerical stability)
229
- - **α = 0.01, β = 0.001:** Empirically optimized weights
230
-
231
- ### Memory Optimization
232
-
233
- - **Gradient Checkpointing:** Reduces VRAM usage by ~40%
234
- - **Mixed Precision (BF16):** 2x faster training
235
- - **Gradient Accumulation:** Simulates larger batch sizes
236
- - **Weight Tying:** LM head shares weights with token embeddings
237
-
238
- ## 📚 References
239
-
240
- This project implements techniques from the following research papers:
241
-
242
- - **ST-MoE:** [Zoph et al. 2022 - "Designing Effective Sparse Expert Models"](https://arxiv.org/abs/2202.08906)
243
- - **Switch Transformer:** [Fedus et al. 2022 - "Switch Transformers"](https://arxiv.org/abs/2101.03961)
244
- - **RoFormer:** [Su et al. 2021 - "RoFormer: Enhanced Transformer with Rotary Position Embedding"](https://arxiv.org/abs/2104.09864)
245
-
246
- ## 📄 License
247
-
248
- MIT
249
-
250
- ## 🙏 Acknowledgments
251
-
252
- - HuggingFace Transformers team for the excellent framework
253
- - PyTorch team for SDPA and optimized operations
254
- - nanoGPT/nanoMoE community for inspiration
 
1
+ # German MoE GPT v8 - OPUS EDITION
2
+
3
+ A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
4
+
5
+ > **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
6
+
7
+ ## Model Description
8
+
9
+ This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
10
+
11
+ ### Key Features
12
+
13
+ - 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
14
+ - 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
15
+ - ⚡ **Efficient:** Only ~33% of parameters active per token
16
+ - 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
17
+ - 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
18
+
19
+ ## Model Specifications
20
+
21
+ | Specification | Value |
22
+ |--------------|-------|
23
+ | Total Parameters | 149.6M |
24
+ | Active Parameters per Token | ~49.9M (~33%) |
25
+ | Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
26
+ | Context Length | 2048 tokens |
27
+ | Architecture | Hybrid Dense + MoE Transformer |
28
+ | Layers | 12 |
29
+ | Hidden Size | 768 |
30
+ | Attention Heads | 12 |
31
+ | Experts per MoE Layer | 32 |
32
+ | Active Experts (Top-k) | 2 |
33
+ | Position Embeddings | RoPE (Rotary Position Embeddings) |
34
+
35
+ ## Training Data
36
+
37
+ The model was trained on a 17.4 GB curated German corpus consisting of:
38
+
39
+ - **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
40
+ - **OpenSubtitles (German)**: Natural dialog and conversational language
41
+ - **Belletristik**: German literature for style and creativity
42
+
43
+ **Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
44
+
45
+ > **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
46
+
47
+ ## Training Details
48
+
49
+ ### Training Hyperparameters
50
+
51
+ - **Steps:** 300,000
52
+ - **Batch Size:** 32 (with gradient accumulation)
53
+ - **Learning Rate:** 3e-4 (max)
54
+ - **Hardware:** Single RTX 4090 (24GB VRAM)
55
+ - **Training Time:** ~120 hours
56
+ - **Precision:** Mixed (BF16)
57
+
58
+ ### Results
59
+
60
+ | Metric | Initial | Final | Improvement |
61
+ |--------|---------|-------|-------------|
62
+ | Training Loss | 12.0 | 2.55 | 79% ↓ |
63
+ | Validation Loss | 4.58 | 2.40 | 48% ↓ |
64
+ | Perplexity | - | 11.0 | - |
65
+
66
+ ## Usage
67
+
68
+ ### Installation
69
+
70
+ ```bash
71
+ pip install transformers torch
72
+ ```
73
+
74
+ ### Quick Start
75
+
76
+ ```python
77
+ from transformers import AutoTokenizer, AutoModelForCausalLM
78
+
79
+ # Load model and tokenizer
80
+ model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
81
+ tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
82
+
83
+ # Generate text
84
+ prompt = "Die Hauptstadt von Deutschland ist"
85
+ inputs = tokenizer(prompt, return_tensors="pt")
86
+ outputs = model.generate(
87
+ **inputs,
88
+ max_new_tokens=100,
89
+ temperature=0.8,
90
+ top_k=50,
91
+ top_p=0.9,
92
+ do_sample=True
93
+ )
94
+
95
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
96
+ ```
97
+
98
+ ### Advanced Usage
99
+
100
+ ```python
101
+ # Generate with custom parameters
102
+ outputs = model.generate(
103
+ **inputs,
104
+ max_new_tokens=200,
105
+ temperature=0.7, # Lower = more deterministic
106
+ top_k=40, # Top-k sampling
107
+ top_p=0.95, # Nucleus sampling
108
+ repetition_penalty=1.1, # Reduce repetition
109
+ do_sample=True
110
+ )
111
+ ```
112
+
113
+ ## Technical Architecture
114
+
115
+ ### MoE Layer Design
116
+
117
+ The model uses a **Noisy Top-k Router** with the following components:
118
+
119
+ 1. **Gate Computation:** Learned routing weights per expert
120
+ 2. **Noise Injection:** Adds controlled noise during training for exploration
121
+ 3. **Top-k Selection:** Routes each token to the 2 best experts
122
+ 4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
123
+ 5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
124
+
125
+ ### Loss Functions
126
+
127
+ The training loss combines three components:
128
+
129
+ ```
130
+ L_total = L_ce + α * L_aux + β * L_z
131
+ ```
132
+
133
+ - **L_ce:** Cross-entropy language modeling loss
134
+ - **L_aux:** Load balance loss (α = 0.01) for uniform expert utilization
135
+ - **L_z:** Router z-loss (β = 0.001) for numerical stability
136
+
137
+ ### Attention Mechanism
138
+
139
+ - **RoPE (Rotary Position Embeddings)** for position encoding
140
+ - **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
141
+ - **Causal masking** for autoregressive generation
142
+
143
+ ### Optimizations
144
+
145
+ - **Gradient Checkpointing:** ~40% VRAM reduction
146
+ - ✅ **Mixed Precision (BF16):** 2x faster training
147
+ - **Weight Tying:** LM head shares embeddings
148
+ - **Batch Expert Processing:** Parallel computation for all experts
149
+
150
+ ## Limitations and Biases
151
+
152
+ - **Language:** Primarily trained on German text
153
+ - **Domain:** General domain (Wikipedia, literature, subtitles)
154
+ - **Biases:** May reflect biases present in training data
155
+ - **Context:** Limited to 2048 tokens
156
+ - **Compute:** Requires GPU for efficient inference
157
+
158
+ ## Ethical Considerations
159
+
160
+ This model is a language model and can generate text that may be:
161
+ - Factually incorrect
162
+ - Biased or stereotypical
163
+ - Inappropriate or offensive
164
+
165
+ Users should:
166
+ - Verify generated content for factual accuracy
167
+ - Be aware of potential biases
168
+ - Use appropriate content filtering for production applications
169
+
170
+ ## Citation
171
+
172
+ If you use this model in your research, please cite:
173
+
174
+ ```bibtex
175
+ @misc{german-moe-gpt-v8,
176
+ title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
177
+ author={[Your Name]},
178
+ year={2025},
179
+ howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
180
+ }
181
+ ```
182
+
183
+ ## References
184
+
185
+ This implementation is based on:
186
+
187
+ - **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
188
+ - **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
189
+ - **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
190
+
191
+ ## License
192
+
193
+ MIT License - See LICENSE file for details
194
+
195
+ ## Acknowledgments
196
+
197
+ - HuggingFace Transformers team for the excellent framework
198
+ - PyTorch team for SDPA and optimized operations
199
+ - nanoGPT/nanoMoE community for inspiration
200
+
201
+ ## Model Card Contact
202
+
203
+ For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).