Note: You must run with --disable-shared-experts-fusion in sglang, otherwise it will incorrectly attempt to fuse the BF16 shared expert.

Update 5/6/26 - sglang container updated with SM120/RTX 6000 Native Sparse Attention support

Model Description

GLM-5.1-NVFP4 is an NVFP4-quantized version of zai-org/GLM-5.1, a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).

Quantized directly from the full BF16 checkpoint (zai-org/GLM-5.1), not the FP8 release, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

What's quantized

Only the non-shared MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Three calibration passes were run:

  1. Coding pass — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
  2. Broad pass — Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
  3. Deep pass — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.

Requirements

Hardware: 8x RTX PRO 6000 Blackwell 96GB (b12x MoE runner recommended)

Community Testing

vLLM

  SPEC_CONFIG='{"model":"lukealonso/GLM-5.1-NVFP4-MTP","method":"mtp","num_speculative_tokens":3,"rejection_sample_method":"probabilistic","moe_backend":"flashinfer_cutlass","use_local_argmax_reduction":true}'
  HF_OVERRIDES='{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'

  docker run -d --gpus all --ipc=host --network host --privileged \
    --name vllm-glm51-ficutlass \
    --entrypoint /bin/bash \
    -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
    -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    -e OMP_NUM_THREADS=16 \
    -e CUTE_DSL_ARCH=sm_120a \
    -e CUDA_DEVICE_MAX_CONNECTIONS=32 \
    -e NCCL_P2P_LEVEL=SYS \
    -e NCCL_GRAPH_FILE=/opt/vllm/nccl_graph_opt.xml \
    -e VLLM_ENABLE_PCIE_ALLREDUCE=1 \
    -e VLLM_USE_B12X_SPARSE_INDEXER=1 \
    -e VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 \
    -e VLLM_DISABLED_KERNELS=MarlinFP8ScaledMMLinearKernel \
    -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \
    -e VLLM_LOG_STATS_INTERVAL=1 \
    -e VLLM_MTP_RETURN_NORMALIZED_HIDDEN=1 \
    -e VLLM_B12X_MLA_SPEC_SERIAL_DECODE=0 \
    -e XDG_CACHE_HOME=/cache/jit \
    -e CUDA_CACHE_PATH=/cache/jit \
    -e VLLM_CACHE_DIR=/cache/jit/vllm \
    -e TVM_FFI_CACHE_DIR=/cache/jit/tvm-ffi \
    -e FLASHINFER_WORKSPACE_BASE=/cache/jit/flashinfer \
    -e VLLM_CACHE_ROOT=/root/.cache/vllm \
    -e TRITON_CACHE_DIR=/root/.cache/triton \
    -e TORCHINDUCTOR_CACHE_DIR=/root/.cache/torchinductor \
    -e TORCH_EXTENSIONS_DIR=/cache/jit/torch_extensions \
    -e CUTE_DSL_CACHE_DIR=/root/.cache/cutlass_dsl \
    -e SPEC_CONFIG="$SPEC_CONFIG" \
    -e HF_OVERRIDES="$HF_OVERRIDES" \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v /mnt/nccl_graph_opt.xml:/opt/vllm/nccl_graph_opt.xml:ro \
    -v ~/.cache/vllm-glm51/cutlass_dsl:/root/.cache/cutlass_dsl \
    -v ~/.cache/vllm-glm51/jit:/cache/jit \
    -v ~/.cache/vllm-glm51/vllm:/root/.cache/vllm \
    -v ~/.cache/vllm-glm51/triton:/root/.cache/triton \
    -v ~/.cache/vllm-glm51/torchinductor:/root/.cache/torchinductor \
    voipmonitor/vllm:glm51-mtp-b12xsparse-ficutlass-topkfix-b12x0111-cg256-20260504 \
    -lc 'exec /opt/venv/bin/vllm serve lukealonso/GLM-5.1-NVFP4-MTP \
      --served-model-name GLM-5 \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port 5288 \
      --tensor-parallel-size 8 \
      --pipeline-parallel-size 1 \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --load-format fastsafetensors \
      --async-scheduling \
      --gpu-memory-utilization 0.865 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 64 \
      --mm-processor-cache-gb 0 \
      --mm-encoder-tp-mode weights \
      --attention-backend B12X_MLA_SPARSE \
      --moe-backend flashinfer_cutlass \
      --kv-cache-dtype fp8 \
      --tool-call-parser glm47 \
      --enable-auto-tool-choice \
      --reasoning-parser glm45 \
      --speculative-config "$SPEC_CONFIG" \
      --hf-overrides "$HF_OVERRIDES" \
      --max-cudagraph-capture-size 256'

sglang

  Docker Image: voipmonitor/sglang:glm51-nsa-luke-a2573ab-b12x0110-20260430        
  Model: lukealonso/GLM-5.1-NVFP4
          
  Launch command:                 
  export OMP_NUM_THREADS=16       
  export SGLANG_ENABLE_SPEC_V2=True                       
  export NVIDIA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8  # 8x Blackwell                   
          
  python -m sglang.launch_server \
    --model-path /path/to/lukealonso/GLM-5.1-NVFP4 \        
    --served-model-name GLM-5 \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --disable-piecewise-cuda-graph \
    --kv-cache-dtype fp8_e4m3 \
    --trust-remote-code \
    --enable-pcie-oneshot-allreduce \
    --disable-shared-experts-fusion \
    --page-size 64 \
    --nsa-prefill-backend b12x \
    --nsa-decode-backend b12x \
    --attention-backend nsa \
    --moe-runner-backend b12x \
    --fp4-gemm-backend b12x \
    --cuda-graph-max-bs 8 \
    --chunked-prefill-size 4096 \
    --preferred-sampling-params '{"temperature": 1.0, "top_p": 0.95}' \
    --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
    --max-running-requests 8 \
    --mem-fraction-static 0.825 \
    --host 0.0.0.0 --port 8000
Downloads last month
37,320
Safetensors
Model size
437B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for lukealonso/GLM-5.1-NVFP4

Base model

zai-org/GLM-5.1
Quantized
(40)
this model
Quantizations
5 models