File size: 2,173 Bytes
fb88ec2 6ce83bd fb88ec2 99038dd fb88ec2 99038dd 6ce83bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
tags:
- moe
- minimax
- bfloat16
- sglang
- mlx
license: mit
datasets:
- nick007x/github-code-2025
- tatsu-lab/alpaca
base_model:
- MiniMaxAI/MiniMax-M2
---

# VibeStudio/MiniMax-M2-THRIFT-55-v1
**Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned**
A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments.
## TLDR
* **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation.
* **Why:** Push the efficiency frontier for compact, responsive deployments.
* **Now:** Ready for experimentation with solid coverage across core evals and more on the way.
---
## Why it’s useful
* **Lower latency:** Fast, responsive interactions for interactive apps and tools.
* **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density.
* **Higher throughput:** Serve more concurrent users on the same hardware.
* **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API.
* **Adaptable:** Plays well with light fine-tuning to match domain and style.
## Intended use
* Local/air-gapped assistants and dev tools
* Cost-sensitive batches and realtime services
* Edge and on-prem deployments prioritizing efficiency
---
## How Our Approach Works
> **Active research in progress** — we continue to iterate and expand ablations.
* **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student.
* **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts.
* **Distill after each prune:** Retrain the student to imitate the teacher on
* **Outputs** (token probability distributions),
* **Hidden states**, and
* **Router behavior** over the **surviving experts**.
---
**Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)**
https://github.com/latent-variable/minimax-agent-guide |