VibeStudio
/

MiniMax-M2-THRIFT-55-MLX-4bit

Mixture of Experts

4-bit precision

Model card Files Files and versions

vibestudio-HQ commited on 15 days ago

Commit

99038dd

·

verified ·

1 Parent(s): 6543e09

Update README.md

Files changed (1) hide show

README.md +50 -4

README.md CHANGED Viewed

@@ -5,12 +5,58 @@ tags:
 - bfloat16
 - sglang
 - gguf
-- mlx
 license: mit
 datasets:
 - nick007x/github-code-2025
 - tatsu-lab/alpaca
-base_model: VibeStudio/MiniMax-M2-THRIFT-55
-pipeline_tag: text-generation
-library_name: mlx
 ---

 - bfloat16
 - sglang
 - gguf
 license: mit
 datasets:
 - nick007x/github-code-2025
 - tatsu-lab/alpaca
+base_model:
+- MiniMaxAI/MiniMax-M2
 ---
+![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
+# VibeStudio/MiniMax-M2-THRIFT-55-v1
+**Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned**
+A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments.
+## TLDR
+* **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation.
+* **Why:** Push the efficiency frontier for compact, responsive deployments.
+* **Now:** Ready for experimentation with solid coverage across core evals and more on the way.
+---
+## Why it’s useful
+* **Lower latency:** Fast, responsive interactions for interactive apps and tools.
+* **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density.
+* **Higher throughput:** Serve more concurrent users on the same hardware.
+* **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API.
+* **Adaptable:** Plays well with light fine-tuning to match domain and style.
+## Intended use
+* Local/air-gapped assistants and dev tools
+* Cost-sensitive batches and realtime services
+* Edge and on-prem deployments prioritizing efficiency
+---
+## How Our Approach Works
+> **Active research in progress** — we continue to iterate and expand ablations.
+* **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student.
+* **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts.
+* **Distill after each prune:** Retrain the student to imitate the teacher on
+  * **Outputs** (token probability distributions),
+  * **Hidden states**, and
+  * **Router behavior** over the **surviving experts**.
+---
+**Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)**
+https://github.com/latent-variable/minimax-agent-guide