vibestudio-HQ commited on
Commit
99038dd
·
verified ·
1 Parent(s): 6543e09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -4
README.md CHANGED
@@ -5,12 +5,58 @@ tags:
5
  - bfloat16
6
  - sglang
7
  - gguf
8
- - mlx
9
  license: mit
10
  datasets:
11
  - nick007x/github-code-2025
12
  - tatsu-lab/alpaca
13
- base_model: VibeStudio/MiniMax-M2-THRIFT-55
14
- pipeline_tag: text-generation
15
- library_name: mlx
16
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - bfloat16
6
  - sglang
7
  - gguf
 
8
  license: mit
9
  datasets:
10
  - nick007x/github-code-2025
11
  - tatsu-lab/alpaca
12
+ base_model:
13
+ - MiniMaxAI/MiniMax-M2
 
14
  ---
15
+ ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
16
+
17
+ # VibeStudio/MiniMax-M2-THRIFT-55-v1
18
+
19
+ **Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned**
20
+
21
+ A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments.
22
+
23
+ ## TLDR
24
+
25
+ * **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation.
26
+ * **Why:** Push the efficiency frontier for compact, responsive deployments.
27
+ * **Now:** Ready for experimentation with solid coverage across core evals and more on the way.
28
+
29
+ ---
30
+
31
+ ## Why it’s useful
32
+
33
+ * **Lower latency:** Fast, responsive interactions for interactive apps and tools.
34
+ * **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density.
35
+ * **Higher throughput:** Serve more concurrent users on the same hardware.
36
+ * **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API.
37
+ * **Adaptable:** Plays well with light fine-tuning to match domain and style.
38
+
39
+ ## Intended use
40
+
41
+ * Local/air-gapped assistants and dev tools
42
+ * Cost-sensitive batches and realtime services
43
+ * Edge and on-prem deployments prioritizing efficiency
44
+
45
+ ---
46
+
47
+ ## How Our Approach Works
48
+
49
+ > **Active research in progress** — we continue to iterate and expand ablations.
50
+
51
+ * **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student.
52
+ * **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts.
53
+ * **Distill after each prune:** Retrain the student to imitate the teacher on
54
+
55
+ * **Outputs** (token probability distributions),
56
+ * **Hidden states**, and
57
+ * **Router behavior** over the **surviving experts**.
58
+
59
+ ---
60
+
61
+ **Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)**
62
+ https://github.com/latent-variable/minimax-agent-guide