---
license: mit
base_model:
- zerofata/GLM-4.5-Iceblink-v2-106B-A12B
datasets:
- zerofata/Instruct-Anime
- zerofata/Roleplay-Anime-Characters
- zerofata/Instruct-Anime-CreativeWriting
- zerofata/Summaries-Anime-FandomPages
pipeline_tag: text-generation
tags:
- text adventure
- roleplay
- rpg
- creative writing
- conversational
- vllm
---
# GLM-4.5-Iceblink-v2-106B-A12B (W8A8 FP8 with 2D-block quantization)

This repo contains GLM-4.5-Iceblink-v2-106B-A12B quantized with mixed FP8/BF16 precision following state-of-the-art Mixture-Of-Expert quantization.

- Original Model:
  - [zerofata/GLM-4.5-Iceblink-v2-106B-A12B](https://huggingface.co/zerofata/GLM-4.5-Iceblink-v2-106B-A12B)

The model requires Ada (4000 series), Hopper (H100) or Blackwell (5000 series) GPUs for hardware FP8 support.

## 📥 Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

### Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

The recommended sampler is "min-p" sampling, this sampling is available through
both the oldest Text completions API and the Chat completions API (and there is a new Response API),
however most LLM frontends only support modifying min-p when using Text completions.
You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)

### Running script

```bash
# Model configuration (Mandatory)
MODEL="mratsim/GLM-4.5-Iceblink-v2-106B-A12B-FP8"
MODELNAME="GLM-4.5-Iceblink-v2"
GPU_UTIL=0.75

# Sampling configuration (Optional, if departing from `generation_config.json`)
# Note that top_p=0.95 seems to lead to a serious paragraph repetition issue
SAMPLER_OVERRIDE='{"temperature": 0.8, "min_p": 0.05, "top_p": 1}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
# export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --tensor-parallel 2 \
  --served-model-name "${MODELNAME}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"
```

> ℹ️ The FlashInfer backend may fail with an error similar to
> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
>
> A workaround is running a sed replacement command within vllm install to increase buffer space
> ```bash
> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
> ```
> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269

## 🔬 Quantization method

The llmcompressor library was used with the following recipe:

```
    scheme=QuantizationScheme(
        targets=["Linear"],
        weights=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            dynamic=False,
            symmetric=True,
            strategy=QuantizationStrategy.BLOCK,
            block_structure=[32, 32],
        ),
        input_activations=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.GROUP,
            symmetric=True,
            dynamic=True,
            observer=None,
            group_size=128,
        ),
    ),
    ignore=[
        "lm_head",
        "model.embed_tokens",
        "model.norm",
        "re:.*input_layernorm$",
        "re:.*post_attention_layernorm$",
        "re:.*self_attn.*",
        "re:.*shared_experts.*",
        "re:.*mlp\\.gate$",             # MoE router
        "re:model.layers.0.*",          # Keep first block, (GLM-4.5-Air first_k_dense_replace = 1), also weird loading here https://github.com/vllm-project/vllm/blob/v0.11.0/vllm/model_executor/models/glm4_moe.py#L525-L547
        "re:model.layers.46.*"          # MTP layer (Multi-token prediction, cannot be loaded by huggingface/transformers)
    ],
```

FP8 quantization does not require calibration

### Deep-dive

Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias)
In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the
> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression.
> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

_Note: Experts layers might not be stored as a `Linear` layer, meaning they might be skipped if using `llmcompressor` with a `Linear` target._

Some layers have a higher impact on LLM performance.
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact

Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers.

We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
  - https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
  - https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks.
In this case, we keep the first layer unquantized as `"first_k_dense_replace": 1` in [config.json](config.json)

### References

1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\
  Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\
  https://arxiv.org/pdf/2506.12044

2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
  Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
  https://arxiv.org/pdf/2406.08155v1

3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
  Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
  https://arxiv.org/pdf/2310.02410