Update README.md
Browse files
README.md
CHANGED
|
@@ -31,12 +31,6 @@ Engineered by [SCALAI](https://scalai.es), this model was surgically distilled f
|
|
| 31 |
* **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
|
| 32 |
* **Languages:** English, Spanish
|
| 33 |
|
| 34 |
-
## 🔬 The Innovation: Activation-Guided Sparsity & Router Healing
|
| 35 |
-
Standard magnitude-based pruning often lobotomizes minority capabilities like secondary languages or strict syntax. To build ScaLite-60B, we pioneered a behavioral approach:
|
| 36 |
-
|
| 37 |
-
1. **Activation-Guided Sparsity:** We injected forward hooks into the original 120B model and passed a bilingual calibration dataset of complex math and code. We tracked actual expert utilization and permanently severed the 64 experts that represented the "encyclopedic long tail," preserving only the structural and logical specialists.
|
| 38 |
-
2. **Cross-Domain Router Healing:** Pruning 50% of the experts causes "Router Trauma" (probability misalignment). Instead of retraining on math (which risks data leakage), we froze the surviving experts and fine-tuned the router for 3,000 steps *exclusively* on Python code (`CodeAlpaca_20K`). This taught the router structural discipline, which miraculously generalized to mathematical reasoning.
|
| 39 |
-
|
| 40 |
## 📊 Benchmark Performance
|
| 41 |
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
|
| 42 |
|
|
@@ -49,7 +43,7 @@ ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation.
|
|
| 49 |
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
|
| 50 |
|
| 51 |
### GSM8K (Math Reasoning)
|
| 52 |
-
* **Pre-Healing (
|
| 53 |
* **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
|
| 54 |
|
| 55 |
## 🎯 Intended Use & Limitations
|
|
|
|
| 31 |
* **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
|
| 32 |
* **Languages:** English, Spanish
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
## 📊 Benchmark Performance
|
| 35 |
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
|
| 36 |
|
|
|
|
| 43 |
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
|
| 44 |
|
| 45 |
### GSM8K (Math Reasoning)
|
| 46 |
+
* **Pre-Healing (Pruned):** 17.59%
|
| 47 |
* **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
|
| 48 |
|
| 49 |
## 🎯 Intended Use & Limitations
|