File size: 2,173 Bytes
fb88ec2
 
 
 
 
 
6ce83bd
fb88ec2
 
 
 
99038dd
 
fb88ec2
99038dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ce83bd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
tags:
- moe
- minimax
- bfloat16
- sglang
- mlx
license: mit
datasets:
- nick007x/github-code-2025
- tatsu-lab/alpaca
base_model:
- MiniMaxAI/MiniMax-M2
---
![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)

# VibeStudio/MiniMax-M2-THRIFT-55-v1

**Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned**

A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments.

## TLDR

* **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation.
* **Why:** Push the efficiency frontier for compact, responsive deployments.
* **Now:** Ready for experimentation with solid coverage across core evals and more on the way.

---

## Why it’s useful

* **Lower latency:** Fast, responsive interactions for interactive apps and tools.
* **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density.
* **Higher throughput:** Serve more concurrent users on the same hardware.
* **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API.
* **Adaptable:** Plays well with light fine-tuning to match domain and style.

## Intended use

* Local/air-gapped assistants and dev tools
* Cost-sensitive batches and realtime services
* Edge and on-prem deployments prioritizing efficiency

---

## How Our Approach Works

> **Active research in progress** — we continue to iterate and expand ablations.

* **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student.
* **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts.
* **Distill after each prune:** Retrain the student to imitate the teacher on

  * **Outputs** (token probability distributions),
  * **Hidden states**, and
  * **Router behavior** over the **surviving experts**.

---

**Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)**
https://github.com/latent-variable/minimax-agent-guide