lukealonso
/

MiniMax-M2.7-NVFP4

@@ -26,13 +26,90 @@ Calibration uses natural top-k routing rather than forcing all experts to activa
 Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.
-### Quality
-(pending)
-You should always evaluate against your specific use case.
-#### SGLang
 Tested on 2x and 4x RTX Pro 6000 Blackwell.
 ```
@@ -62,8 +139,4 @@ Tested on 2x and 4x RTX Pro 6000 Blackwell.
     --host 0.0.0.0 --port 5000
 ```
-#### vLLM
-(pending)
 ```

 Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.
+### Running
+```
+exec docker run \
+    --name sglang-m27a \
+    --ipc=host \
+    --shm-size=12g \
+    --network=host \
+    --cpuset-cpus=0-31 \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    --ulimit nofile=1048576:1048576 \
+    --restart unless-stopped \
+    -e SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=1 \
+    -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
+    -e SGLANG_CUSTOM_ALLREDUCE_ALGO=oneshot \
+    -e SGLANG_DISABLE_FA4_WARMUP=1 \
+    -e SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD=4096 \
+    -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
+    -e NCCL_IB_DISABLE=1 \
+    -e NCCL_P2P_DISABLE=1 \
+    -e NCCL_NVLS_ENABLE=0 \
+    -e NCCL_CUMEM_ENABLE=0 \
+    -e NCCL_P2P_LEVEL=SYS \
+    -e B12X_MOE_FORCE_A16=1 \
+    -e NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 \
+    -e NCCL_MIN_NCHANNELS=8 \
+    -e NCCL_SOCKET_NTHREADS=4 \
+    -e NCCL_NSOCKS_PERTHREAD=2 \
+    -e NCCL_BUFFSIZE=16777216 \
+    -e TORCH_NCCL_AVOID_RECORD_STREAMS=1 \
+    -e OMP_NUM_THREADS=16 \
+    -e MKL_NUM_THREADS=16 \
+    -e OPENBLAS_NUM_THREADS=16 \
+    -e NUMEXPR_NUM_THREADS=16 \
+    -e TOKENIZERS_PARALLELISM=false \
+    -e CUDA_DEVICE_MAX_CONNECTIONS=1 \
+    -e CUDA_MODULE_LOADING=LAZY \
+    -e SAFETENSORS_FAST_GPU=1 \
+    -e TRITON_CACHE_DIR=/cache/triton \
+    -e TORCH_COMPILE_DEBUG=0 \
+    -e HF_HUB_ENABLE_HF_TRANSFER=1 \
+    -e HF_HOME=/root/.cache/huggingface \
+    -e TRANSFORMERS_OFFLINE=1 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+       llm-sglang-blackwell:cu130 \
+    python -m sglang.launch_server \
+    --grammar-backend none \
+    --model lukealonso/MiniMax-M2.7-NVFP4 \
+    --served-model-name MiniMax-M2.7 \
+    --tensor-parallel-size 2 \
+    --quantization modelopt_fp4 \
+    --kv-cache-dtype bfloat16 \
+    --dtype auto \
+    --prefill-max-requests 4 \
+    --stream-interval 16 \
+    --load-format safetensors \
+    --trust-remote-code \
+    --context-length 196608 \
+    --mem-fraction-static 0.93 \
+    --chunked-prefill-size 4096 \
+    --max-prefill-tokens 4096 \
+    --disable-radix-cache \
+    --schedule-conservativeness 0.40 \
+    --max-running-requests 8 \
+    --cuda-graph-max-bs 8 \
+    --sampling-backend flashinfer \
+    --cuda-graph-bs 1 2 4 6 8 \
+    --num-continuous-decode-steps 4 \
+    --enable-mixed-chunk \
+    --attention-backend flashinfer \
+    --moe-runner-backend b12x \
+    --fp4-gemm-backend b12x \
+    --enable-pcie-oneshot-allreduce \
+    --pcie-oneshot-allreduce-max-size 8388608 \
+    --tool-call-parser minimax-m2 \
+    --reasoning-parser minimax-append-think \
+    --host 127.0.0.1 \
+    --hicache-size 36 \
+    --hicache-io-backend kernel \
+    --hicache-mem-layout page_first \
+    --hicache-write-policy write_through_selectitive
+  ```
 Tested on 2x and 4x RTX Pro 6000 Blackwell.
 ```
     --host 0.0.0.0 --port 5000
 ```
 ```