Vicens commited on
Commit
00dbee7
·
verified ·
1 Parent(s): 036f667

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -7
README.md CHANGED
@@ -31,12 +31,6 @@ Engineered by [SCALAI](https://scalai.es), this model was surgically distilled f
31
  * **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
32
  * **Languages:** English, Spanish
33
 
34
- ## 🔬 The Innovation: Activation-Guided Sparsity & Router Healing
35
- Standard magnitude-based pruning often lobotomizes minority capabilities like secondary languages or strict syntax. To build ScaLite-60B, we pioneered a behavioral approach:
36
-
37
- 1. **Activation-Guided Sparsity:** We injected forward hooks into the original 120B model and passed a bilingual calibration dataset of complex math and code. We tracked actual expert utilization and permanently severed the 64 experts that represented the "encyclopedic long tail," preserving only the structural and logical specialists.
38
- 2. **Cross-Domain Router Healing:** Pruning 50% of the experts causes "Router Trauma" (probability misalignment). Instead of retraining on math (which risks data leakage), we froze the surviving experts and fine-tuned the router for 3,000 steps *exclusively* on Python code (`CodeAlpaca_20K`). This taught the router structural discipline, which miraculously generalized to mathematical reasoning.
39
-
40
  ## 📊 Benchmark Performance
41
  ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
42
 
@@ -49,7 +43,7 @@ ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation.
49
  | History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
50
 
51
  ### GSM8K (Math Reasoning)
52
- * **Pre-Healing (Traumatized):** 17.59%
53
  * **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
54
 
55
  ## 🎯 Intended Use & Limitations
 
31
  * **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
32
  * **Languages:** English, Spanish
33
 
 
 
 
 
 
 
34
  ## 📊 Benchmark Performance
35
  ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
36
 
 
43
  | History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
44
 
45
  ### GSM8K (Math Reasoning)
46
+ * **Pre-Healing (Pruned):** 17.59%
47
  * **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
48
 
49
  ## 🎯 Intended Use & Limitations