Note: You must run with --disable-shared-experts-fusion in sglang, otherwise it will incorrectly attempt to fuse the BF16 shared expert.
Update 5/6/26 - sglang container updated with SM120/RTX 6000 Native Sparse Attention support
Model Description
GLM-5.1-NVFP4 is an NVFP4-quantized version of zai-org/GLM-5.1, a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).
Quantized directly from the full BF16 checkpoint (zai-org/GLM-5.1), not the FP8 release, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.
What's quantized
Only the non-shared MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.
Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.
Calibration dataset
Three calibration passes were run:
- Coding pass — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
- Broad pass — Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
- Deep pass — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
Requirements
Hardware: 8x RTX PRO 6000 Blackwell 96GB (b12x MoE runner recommended)
Community Testing
vLLM
SPEC_CONFIG='{"model":"lukealonso/GLM-5.1-NVFP4-MTP","method":"mtp","num_speculative_tokens":3,"rejection_sample_method":"probabilistic","moe_backend":"flashinfer_cutlass","use_local_argmax_reduction":true}'
HF_OVERRIDES='{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'
docker run -d --gpus all --ipc=host --network host --privileged \
--name vllm-glm51-ficutlass \
--entrypoint /bin/bash \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e OMP_NUM_THREADS=16 \
-e CUTE_DSL_ARCH=sm_120a \
-e CUDA_DEVICE_MAX_CONNECTIONS=32 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_GRAPH_FILE=/opt/vllm/nccl_graph_opt.xml \
-e VLLM_ENABLE_PCIE_ALLREDUCE=1 \
-e VLLM_USE_B12X_SPARSE_INDEXER=1 \
-e VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 \
-e VLLM_DISABLED_KERNELS=MarlinFP8ScaledMMLinearKernel \
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \
-e VLLM_LOG_STATS_INTERVAL=1 \
-e VLLM_MTP_RETURN_NORMALIZED_HIDDEN=1 \
-e VLLM_B12X_MLA_SPEC_SERIAL_DECODE=0 \
-e XDG_CACHE_HOME=/cache/jit \
-e CUDA_CACHE_PATH=/cache/jit \
-e VLLM_CACHE_DIR=/cache/jit/vllm \
-e TVM_FFI_CACHE_DIR=/cache/jit/tvm-ffi \
-e FLASHINFER_WORKSPACE_BASE=/cache/jit/flashinfer \
-e VLLM_CACHE_ROOT=/root/.cache/vllm \
-e TRITON_CACHE_DIR=/root/.cache/triton \
-e TORCHINDUCTOR_CACHE_DIR=/root/.cache/torchinductor \
-e TORCH_EXTENSIONS_DIR=/cache/jit/torch_extensions \
-e CUTE_DSL_CACHE_DIR=/root/.cache/cutlass_dsl \
-e SPEC_CONFIG="$SPEC_CONFIG" \
-e HF_OVERRIDES="$HF_OVERRIDES" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /mnt/nccl_graph_opt.xml:/opt/vllm/nccl_graph_opt.xml:ro \
-v ~/.cache/vllm-glm51/cutlass_dsl:/root/.cache/cutlass_dsl \
-v ~/.cache/vllm-glm51/jit:/cache/jit \
-v ~/.cache/vllm-glm51/vllm:/root/.cache/vllm \
-v ~/.cache/vllm-glm51/triton:/root/.cache/triton \
-v ~/.cache/vllm-glm51/torchinductor:/root/.cache/torchinductor \
voipmonitor/vllm:glm51-mtp-b12xsparse-ficutlass-topkfix-b12x0111-cg256-20260504 \
-lc 'exec /opt/venv/bin/vllm serve lukealonso/GLM-5.1-NVFP4-MTP \
--served-model-name GLM-5 \
--trust-remote-code \
--host 0.0.0.0 \
--port 5288 \
--tensor-parallel-size 8 \
--pipeline-parallel-size 1 \
--enable-chunked-prefill \
--enable-prefix-caching \
--load-format fastsafetensors \
--async-scheduling \
--gpu-memory-utilization 0.865 \
--max-num-batched-tokens 8192 \
--max-num-seqs 64 \
--mm-processor-cache-gb 0 \
--mm-encoder-tp-mode weights \
--attention-backend B12X_MLA_SPARSE \
--moe-backend flashinfer_cutlass \
--kv-cache-dtype fp8 \
--tool-call-parser glm47 \
--enable-auto-tool-choice \
--reasoning-parser glm45 \
--speculative-config "$SPEC_CONFIG" \
--hf-overrides "$HF_OVERRIDES" \
--max-cudagraph-capture-size 256'
sglang
Docker Image: voipmonitor/sglang:glm51-nsa-luke-a2573ab-b12x0110-20260430
Model: lukealonso/GLM-5.1-NVFP4
Launch command:
export OMP_NUM_THREADS=16
export SGLANG_ENABLE_SPEC_V2=True
export NVIDIA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 # 8x Blackwell
python -m sglang.launch_server \
--model-path /path/to/lukealonso/GLM-5.1-NVFP4 \
--served-model-name GLM-5 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--disable-piecewise-cuda-graph \
--kv-cache-dtype fp8_e4m3 \
--trust-remote-code \
--enable-pcie-oneshot-allreduce \
--disable-shared-experts-fusion \
--page-size 64 \
--nsa-prefill-backend b12x \
--nsa-decode-backend b12x \
--attention-backend nsa \
--moe-runner-backend b12x \
--fp4-gemm-backend b12x \
--cuda-graph-max-bs 8 \
--chunked-prefill-size 4096 \
--preferred-sampling-params '{"temperature": 1.0, "top_p": 0.95}' \
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
--max-running-requests 8 \
--mem-fraction-static 0.825 \
--host 0.0.0.0 --port 8000
- Downloads last month
- 37,320