MiniMax-M2.7-REAP-172B-A10B-NVFP4

This is a my second attempt on this model, which I hope to use as a local coder for long range tasks on an NVIDIA Thor dev kit. I tried to find the best reference sources for calibrated weights and expert pruning and see a significant improvement in chat and coding from the first take. One remaining problem, that could be due to REAP as NVFP4 source seems high quality / has proper KV cache scale, is model getting confused with pathnames, like changing case, dropping path components or inserting spaces. I added the following instructions to Kilo code that fixed these issues:

You are an AI coding assistant. Help user write high quality, modular, clean code. Pay special attention to pathnames - make sure you preserve case, do not drop path components or insert spaces and other extra characters.

I will now try to use model for real life tasks and monitor for issues correctable or non correctable by prompts. If you prefer a little smaller un-REAPed model that is less likely to have regressions, you can try my catplusplus/Qwen3.5-122B-A10B-heretic-v2-NVFP4

A REAP-pruned variant of MiniMax-M2.7 with NVFP4-quantized expert weights and FP8 KV-cache scales. The original 256-expert-per-layer MoE has been reduced to 192 experts per layer (25% compression) using the same pruning mask as saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10.

NVFP4 weights source: NinjaBoffin/MiniMax-M2.7-NVFP4
Pruning reference: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10
Architecture: 62 transformer layers, 192 experts/layer, top-8 routing, hidden size 3072, 48 attention heads, 8 KV heads
Total parameters: ~172 B (A10B — ~10 B activated per token)

Quantization Details

Component	Format	Notes
Expert FFN weights (`w1`/`w2`/`w3`)	NVFP4	4-bit float, group_size=16, per-group + global scales
Attention projections (`q/k/v/o_proj`)	bfloat16	Excluded from quantization
Gate weights (`block_sparse_moe.gate`)	bfloat16	Excluded from quantization
KV cache	FP8	Per-layer `k_scale`/`v_scale` tensors

Each expert stores three scale tensors per weight matrix: input_scale (per-tensor activation scale), weight_scale (per-group weight scale), and weight_scale_2 (global weight scale). The FP8 KV cache uses k_proj.k_scale and v_proj.v_scale which vLLM remaps to its internal attention scale slots.

How This Was Made

Pruning Mask Extraction

The REAP expert pruning removes 64 out of 256 experts per layer (different experts per layer — not a uniform pattern). We extracted the pruning mask by comparing router matrices between the original 256-expert NVFP4 model and the already-pruned 192-expert reference model:

Router comparison: For each of the 62 layers, the 256-row original gate weight matrix is matched against the 192-row pruned gate weight matrix using the Hungarian algorithm on cosine distance.
Mask generation: Experts in the original that don't match any row in the pruned model are marked as deleted (64 per layer). The per-layer mask is saved to extras/deleted_experts.json.
Expert deletion: The identified experts are removed from all NVFP4 weight and scale tensors, gate weights are row-sliced, and remaining experts are renumbered 0–191. Asymmetric KV-cache zero-point tensors (k_bias/v_bias) are stripped since vLLM does not use them; the symmetric k_scale/v_scale tensors are preserved.

The full script lives at extras/delete_experts.py and can also apply a saved mask to any compatible model.

Usage with vLLM

This model requires a vLLM build with NVFP4 and FP8 KV-cache support (Blackwell / GB10 or later) and the minimax_m2 model backend.

vllm serve /path/to/MiniMax-M2.7-REAP-172B-A10B-NVFP4 \
    --served-model-name Nikola \
    --port 9000 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_optthink \
    --reasoning-parser-plugin extras/minimax_m2_optthink_reasoning_parser.py \
    --chat-template extras/chat_template.jinja \
    --enable-prefix-caching \
    --attention-backend FLASHINFER \
    --enable-chunked-prefill \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 4

See extras/inference_minimax.sh for the full launch script used on DGX Spark / GB10.

`extras/` Directory

`inference_minimax.sh`

vLLM server launcher for DGX Spark / GB10. Runs inside a custom unglitched_vllm container via dockless. By default loads the parent model directory (where this script lives); pass a path to override:

./extras/inference_minimax.sh                    # serves ../MiniMax-M2.7-REAP-172B-A10B-NVFP4
./extras/inference_minimax.sh /path/to/model     # custom path

Notable flags used:

VLLM_USE_FLASHINFER_MOE_FP4=0 — disables FlashInfer MoE FP4 kernel (uses the fallback NVFP4 GEMM path which is more stable on GB10)
--async-scheduling --enable-chunked-prefill — latency-oriented scheduling
--cudagraph-capture-sizes 1 2 4 — captures CUDA graphs for typical batch sizes

`chat_template.jinja`

Custom Jinja2 chat template that supports optional chain-of-thought reasoning. Pass enable_thinking=False in chat_template_kwargs to suppress <think> blocks:

# OpenAI client — disable thinking
client.chat.completions.create(
    model="Nikola",
    messages=[{"role": "user", "content": "..."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

When enable_thinking=True (the default), the model produces a <think>…</think> block before its answer which the reasoning parser extracts into the reasoning_content field of the response.

`minimax_m2_optthink_reasoning_parser.py`

vLLM reasoning parser registered as minimax_m2_optthink. Handles MiniMax M2's convention where the model emits only a closing </think> token (no opening tag) — all content before </think> is treated as reasoning and placed in reasoning_content; everything after is the assistant reply.

When enable_thinking=False was passed in chat_template_kwargs, the parser skips extraction entirely so no thinking tokens appear in the output.

Load at runtime via --reasoning-parser-plugin — no installation needed:

--reasoning-parser minimax_m2_optthink \
--reasoning-parser-plugin /path/to/extras/minimax_m2_optthink_reasoning_parser.py

`deleted_experts.json`

Pre-computed pruning mask: a JSON mapping each of the 62 layer indices to the list of 64 expert indices (0–255) deleted from that layer. Can be passed directly to delete_experts.py via --deleted-experts-file to skip the Hungarian-matching step.

`delete_experts.py`

Utility script to apply expert deletion to a compatible model. Two modes:

Find pruning mask by comparison then apply:

python extras/delete_experts.py \
    /path/to/MiniMax-M2.7-NVFP4 \
    /path/to/output \
    --num-original-experts 256 \
    --num-retained-experts 192 \
    --compare-with /path/to/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 \
    --save-deleted-experts extras/deleted_experts.json

Apply a saved mask directly:

python extras/delete_experts.py \
    /path/to/MiniMax-M2.7-NVFP4 \
    /path/to/output \
    --num-original-experts 256 \
    --num-retained-experts 192 \
    --deleted-experts-file extras/deleted_experts.json

`unglitched_vllm`

Custom vLLM build script and GPU memory swap utility for DGX Spark. force_swap.cpp implements a small helper that forces GPU pages to swap to system memory, useful for fitting the model on a single GB10 node.

License

This model inherits the MiniMax M2.7 Non-Commercial License. See LICENSE for full terms.

Citation

@misc{minimax-m2-7,
  title={MiniMax-M2.7},
  author={MiniMax},
  url={https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}

@misc{minimax-m2-7-reap-nvfp4,
  title={MiniMax-M2.7-REAP-172B-A10B-NVFP4},
  author={Oleg K.},
  note={NVFP4 weights from NinjaBoffin/MiniMax-M2.7-NVFP4; pruning mask
        from saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10},
  year={2026}
}

Downloads last month: 660

Safetensors

Model size

116B params

Tensor type

BF16

F32

F8_E4M3

Model tree for catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(83)

this model