GLM-4.5-Iceblink-106B-A12B (W8A8 FP8 with 2D-block quantization)

This repo contains GLM-4.5-Iceblink-106B-A12B quantized with mixed FP8/BF16 precision following state-of-the-art Mixture-Of-Expert quantization.

Original Model:
- zerofata/GLM-4.5-Iceblink-106B-A12B

The model requires Ada (4000 series), Hopper (H100) or Blackwell (5000 series) GPUs for hardware FP8 support.

📥 Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

The recommended sampler is "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/GLM-4.5-Iceblink-106B-A12B-FP8"
MODELNAME="GLM-4.5-Iceblink"
GPU_UTIL=0.75

# Sampling configuration (Optional, if departing from `generation_config.json`)
# Note that top_p=0.95 seems to lead to a serious paragraph repetition issue
SAMPLER_OVERRIDE='{"temperature": 0.8, "min_p": 0.05, "top_p": 1}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
# export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --tensor-parallel 2 \
  --served-model-name "${MODELNAME}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

ℹ️ The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269

🔬 Quantization method

The llmcompressor library was used with the following recipe:

    scheme=QuantizationScheme(
        targets=["Linear"],
        weights=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            dynamic=False,
            symmetric=True,
            strategy=QuantizationStrategy.BLOCK,
            block_structure=[32, 32],
        ),
        input_activations=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.GROUP,
            symmetric=True,
            dynamic=True,
            observer=None,
            group_size=128,
        ),
    ),
    ignore=[
        "lm_head",
        "model.embed_tokens",
        "model.norm",
        "re:.*input_layernorm$",
        "re:.*post_attention_layernorm$",
        "re:.*self_attn.*",
        "re:.*shared_experts.*",
        "re:.*mlp\\.gate$",             # MoE router
        "re:model.layers.0.*",          # Keep first block, (GLM-4.5-Air first_k_dense_replace = 1), also weird loading here https://github.com/vllm-project/vllm/blob/v0.11.0/vllm/model_executor/models/glm4_moe.py#L525-L547
        "re:model.layers.46.*"          # MTP layer (Multi-token prediction, cannot be loaded by huggingface/transformers)
    ],

FP8 quantization does not require calibration

Deep-dive

Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

Note: Experts layers might not be stored as a Linear layer, meaning they might be skipped if using llmcompressor with a Linear target.

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality
quantizing cross-attention has some impact
quantizing self-attention has a large impact
quantizing dense FFN has a very significant impact

Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers.

We notice that:

official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks. In this case, we keep the first layer unquantized as "first_k_dense_replace": 1 in config.json

References

Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410

Downloads last month: -

Safetensors

Model size

111B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mratsim/GLM-4.5-Iceblink-106B-A12B-FP8

Base model

zai-org/GLM-4.5-Air

Finetuned

zerofata/GLM-4.5-Iceblink-106B-A12B

Quantized

(9)

this model

mratsim
/

GLM-4.5-Iceblink-106B-A12B-FP8