GLM-4.5-Iceblink-106B-A12B (W8A8 FP8 with 2D-block quantization)
This repo contains GLM-4.5-Iceblink-106B-A12B quantized with mixed FP8/BF16 precision following state-of-the-art Mixture-Of-Expert quantization.
- Original Model:
The model requires Ada (4000 series), Hopper (H100) or Blackwell (5000 series) GPUs for hardware FP8 support.
📥 Usage & Running Instructions
The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.
Recommendations
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
The recommended sampler is "min-p" sampling, this sampling is available through
both the oldest Text completions API and the Chat completions API (and there is a new Response API),
however most LLM frontends only support modifying min-p when using Text completions.
You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)
Running script
# Model configuration (Mandatory)
MODEL="mratsim/GLM-4.5-Iceblink-106B-A12B-FP8"
MODELNAME="GLM-4.5-Iceblink"
GPU_UTIL=0.75
# Sampling configuration (Optional, if departing from `generation_config.json`)
# Note that top_p=0.95 seems to lead to a serious paragraph repetition issue
SAMPLER_OVERRIDE='{"temperature": 0.8, "min_p": 0.05, "top_p": 1}'
# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1
# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
# export VLLM_ATTENTION_BACKEND=FLASHINFER
vllm serve "${MODEL}" \
--tensor-parallel 2 \
--served-model-name "${MODELNAME}" \
--gpu-memory-utilization ${GPU_UTIL} \
--override-generation-config "${SAMPLER_OVERRIDE}"
ℹ️ The FlashInfer backend may fail with an error similar to
Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.pyThis will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269
🔬 Quantization method
The llmcompressor library was used with the following recipe:
scheme=QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
dynamic=False,
symmetric=True,
strategy=QuantizationStrategy.BLOCK,
block_structure=[32, 32],
),
input_activations=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
strategy=QuantizationStrategy.GROUP,
symmetric=True,
dynamic=True,
observer=None,
group_size=128,
),
),
ignore=[
"lm_head",
"model.embed_tokens",
"model.norm",
"re:.*input_layernorm$",
"re:.*post_attention_layernorm$",
"re:.*self_attn.*",
"re:.*shared_experts.*",
"re:.*mlp\\.gate$", # MoE router
"re:model.layers.0.*", # Keep first block, (GLM-4.5-Air first_k_dense_replace = 1), also weird loading here https://github.com/vllm-project/vllm/blob/v0.11.0/vllm/model_executor/models/glm4_moe.py#L525-L547
"re:model.layers.46.*" # MTP layer (Multi-token prediction, cannot be loaded by huggingface/transformers)
],
FP8 quantization does not require calibration
Deep-dive
Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.
Note: Experts layers might not be stored as a Linear layer, meaning they might be skipped if using llmcompressor with a Linear target.
Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact
Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers.
We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.
In this case, we keep the first layer unquantized as "first_k_dense_replace": 1 in config.json
References
Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410
- Downloads last month
- -
Model tree for mratsim/GLM-4.5-Iceblink-106B-A12B-FP8
Base model
zai-org/GLM-4.5-Air