MiniMax-M2.7-REAP-172B-A10B-NVFP4
This is a my second attempt on this model, which I hope to use as a local coder for long range tasks on an NVIDIA Thor dev kit. I tried to find the best reference sources for calibrated weights and expert pruning and see a significant improvement in chat and coding from the first take. One remaining problem, that could be due to REAP as NVFP4 source seems high quality / has proper KV cache scale, is model getting confused with pathnames, like changing case, dropping path components or inserting spaces. I added the following instructions to Kilo code that fixed these issues:
You are an AI coding assistant. Help user write high quality, modular, clean code. Pay special attention to pathnames - make sure you preserve case, do not drop path components or insert spaces and other extra characters.
I will now try to use model for real life tasks and monitor for issues correctable or non correctable by prompts. If you prefer a little smaller un-REAPed model that is less likely to have regressions, you can try my catplusplus/Qwen3.5-122B-A10B-heretic-v2-NVFP4
A REAP-pruned variant of MiniMax-M2.7 with NVFP4-quantized expert weights and FP8 KV-cache scales. The original 256-expert-per-layer MoE has been reduced to 192 experts per layer (25% compression) using the same pruning mask as saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10.
- NVFP4 weights source:
NinjaBoffin/MiniMax-M2.7-NVFP4 - Pruning reference:
saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 - Architecture: 62 transformer layers, 192 experts/layer, top-8 routing, hidden size 3072, 48 attention heads, 8 KV heads
- Total parameters: ~172 B (A10B — ~10 B activated per token)
Quantization Details
| Component | Format | Notes |
|---|---|---|
Expert FFN weights (w1/w2/w3) |
NVFP4 | 4-bit float, group_size=16, per-group + global scales |
Attention projections (q/k/v/o_proj) |
bfloat16 | Excluded from quantization |
Gate weights (block_sparse_moe.gate) |
bfloat16 | Excluded from quantization |
| KV cache | FP8 | Per-layer k_scale/v_scale tensors |
Each expert stores three scale tensors per weight matrix: input_scale
(per-tensor activation scale), weight_scale (per-group weight scale), and
weight_scale_2 (global weight scale). The FP8 KV cache uses
k_proj.k_scale and v_proj.v_scale which vLLM remaps to its internal
attention scale slots.
How This Was Made
Pruning Mask Extraction
The REAP expert pruning removes 64 out of 256 experts per layer (different experts per layer — not a uniform pattern). We extracted the pruning mask by comparing router matrices between the original 256-expert NVFP4 model and the already-pruned 192-expert reference model:
Router comparison: For each of the 62 layers, the 256-row original gate weight matrix is matched against the 192-row pruned gate weight matrix using the Hungarian algorithm on cosine distance.
Mask generation: Experts in the original that don't match any row in the pruned model are marked as deleted (64 per layer). The per-layer mask is saved to
extras/deleted_experts.json.Expert deletion: The identified experts are removed from all NVFP4 weight and scale tensors, gate weights are row-sliced, and remaining experts are renumbered 0–191. Asymmetric KV-cache zero-point tensors (
k_bias/v_bias) are stripped since vLLM does not use them; the symmetrick_scale/v_scaletensors are preserved.
The full script lives at extras/delete_experts.py
and can also apply a saved mask to any compatible model.
Usage with vLLM
This model requires a vLLM build with NVFP4 and FP8 KV-cache support (Blackwell
/ GB10 or later) and the minimax_m2 model backend.
vllm serve /path/to/MiniMax-M2.7-REAP-172B-A10B-NVFP4 \
--served-model-name Nikola \
--port 9000 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_optthink \
--reasoning-parser-plugin extras/minimax_m2_optthink_reasoning_parser.py \
--chat-template extras/chat_template.jinja \
--enable-prefix-caching \
--attention-backend FLASHINFER \
--enable-chunked-prefill \
--gpu-memory-utilization 0.95 \
--max-num-seqs 4
See extras/inference_minimax.sh for the full
launch script used on DGX Spark / GB10.
extras/ Directory
`inference_minimax.sh`
vLLM server launcher for DGX Spark / GB10. Runs inside a custom
unglitched_vllm container via dockless. By default loads the parent model
directory (where this script lives); pass a path to override:
./extras/inference_minimax.sh # serves ../MiniMax-M2.7-REAP-172B-A10B-NVFP4
./extras/inference_minimax.sh /path/to/model # custom path
Notable flags used:
VLLM_USE_FLASHINFER_MOE_FP4=0— disables FlashInfer MoE FP4 kernel (uses the fallback NVFP4 GEMM path which is more stable on GB10)--async-scheduling --enable-chunked-prefill— latency-oriented scheduling--cudagraph-capture-sizes 1 2 4— captures CUDA graphs for typical batch sizes
`chat_template.jinja`
Custom Jinja2 chat template that supports optional chain-of-thought
reasoning. Pass enable_thinking=False in chat_template_kwargs to
suppress <think> blocks:
# OpenAI client — disable thinking
client.chat.completions.create(
model="Nikola",
messages=[{"role": "user", "content": "..."}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
When enable_thinking=True (the default), the model produces a <think>…</think>
block before its answer which the reasoning parser extracts into the
reasoning_content field of the response.
`minimax_m2_optthink_reasoning_parser.py`
vLLM reasoning parser registered as minimax_m2_optthink. Handles MiniMax
M2's convention where the model emits only a closing </think> token (no
opening tag) — all content before </think> is treated as reasoning and
placed in reasoning_content; everything after is the assistant reply.
When enable_thinking=False was passed in chat_template_kwargs, the parser
skips extraction entirely so no thinking tokens appear in the output.
Load at runtime via --reasoning-parser-plugin — no installation needed:
--reasoning-parser minimax_m2_optthink \
--reasoning-parser-plugin /path/to/extras/minimax_m2_optthink_reasoning_parser.py
`deleted_experts.json`
Pre-computed pruning mask: a JSON mapping each of the 62 layer indices to the
list of 64 expert indices (0–255) deleted from that layer. Can be passed
directly to delete_experts.py via --deleted-experts-file to skip the
Hungarian-matching step.
`delete_experts.py`
Utility script to apply expert deletion to a compatible model. Two modes:
Find pruning mask by comparison then apply:
python extras/delete_experts.py \
/path/to/MiniMax-M2.7-NVFP4 \
/path/to/output \
--num-original-experts 256 \
--num-retained-experts 192 \
--compare-with /path/to/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 \
--save-deleted-experts extras/deleted_experts.json
Apply a saved mask directly:
python extras/delete_experts.py \
/path/to/MiniMax-M2.7-NVFP4 \
/path/to/output \
--num-original-experts 256 \
--num-retained-experts 192 \
--deleted-experts-file extras/deleted_experts.json
`unglitched_vllm`
Custom vLLM build script and GPU memory swap utility for DGX Spark.
force_swap.cpp implements a small helper that forces GPU pages to swap to
system memory, useful for fitting the model on a single GB10 node.
License
This model inherits the MiniMax M2.7 Non-Commercial License. See LICENSE for full terms.
Citation
@misc{minimax-m2-7,
title={MiniMax-M2.7},
author={MiniMax},
url={https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}
@misc{minimax-m2-7-reap-nvfp4,
title={MiniMax-M2.7-REAP-172B-A10B-NVFP4},
author={Oleg K.},
note={NVFP4 weights from NinjaBoffin/MiniMax-M2.7-NVFP4; pruning mask
from saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10},
year={2026}
}
- Downloads last month
- 660
Model tree for catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4
Base model
MiniMaxAI/MiniMax-M2.7