--- license: mit base_model: - zerofata/GLM-4.5-Iceblink-v2-106B-A12B datasets: - zerofata/Instruct-Anime - zerofata/Roleplay-Anime-Characters - zerofata/Instruct-Anime-CreativeWriting - zerofata/Summaries-Anime-FandomPages pipeline_tag: text-generation tags: - text adventure - roleplay - rpg - creative writing - conversational - vllm --- # GLM-4.5-Iceblink-v2-106B-A12B (W8A8 FP8 with 2D-block quantization) This repo contains GLM-4.5-Iceblink-v2-106B-A12B quantized with mixed FP8/BF16 precision following state-of-the-art Mixture-Of-Expert quantization. - Original Model: - [zerofata/GLM-4.5-Iceblink-v2-106B-A12B](https://huggingface.co/zerofata/GLM-4.5-Iceblink-v2-106B-A12B) The model requires Ada (4000 series), Hopper (H100) or Blackwell (5000 series) GPUs for hardware FP8 support. ## 📥 Usage & Running Instructions The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length. ### Recommendations It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87) The recommended sampler is "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults) ### Running script ```bash # Model configuration (Mandatory) MODEL="mratsim/GLM-4.5-Iceblink-v2-106B-A12B-FP8" MODELNAME="GLM-4.5-Iceblink-v2" GPU_UTIL=0.75 # Sampling configuration (Optional, if departing from `generation_config.json`) # Note that top_p=0.95 seems to lead to a serious paragraph repetition issue SAMPLER_OVERRIDE='{"temperature": 0.8, "min_p": 0.05, "top_p": 1}' # Prevent memory fragmentation export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 # Prevent vLLM from using 100% CPU when idle (Very Recommended) export VLLM_SLEEP_WHEN_IDLE=1 # Use FlashInfer backend (fastest, recommended, "instant" context reprocessing) # export VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve "${MODEL}" \ --tensor-parallel 2 \ --served-model-name "${MODELNAME}" \ --gpu-memory-utilization ${GPU_UTIL} \ --override-generation-config "${SAMPLER_OVERRIDE}" ``` > ℹ️ The FlashInfer backend may fail with an error similar to > `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`. > > A workaround is running a sed replacement command within vllm install to increase buffer space > ```bash > sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 768 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py > ``` > This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 or https://github.com/vllm-project/vllm/pull/28269 ## 🔬 Quantization method The llmcompressor library was used with the following recipe: ``` scheme=QuantizationScheme( targets=["Linear"], weights=QuantizationArgs( num_bits=8, type=QuantizationType.FLOAT, dynamic=False, symmetric=True, strategy=QuantizationStrategy.BLOCK, block_structure=[32, 32], ), input_activations=QuantizationArgs( num_bits=8, type=QuantizationType.FLOAT, strategy=QuantizationStrategy.GROUP, symmetric=True, dynamic=True, observer=None, group_size=128, ), ), ignore=[ "lm_head", "model.embed_tokens", "model.norm", "re:.*input_layernorm$", "re:.*post_attention_layernorm$", "re:.*self_attn.*", "re:.*shared_experts.*", "re:.*mlp\\.gate$", # MoE router "re:model.layers.0.*", # Keep first block, (GLM-4.5-Air first_k_dense_replace = 1), also weird loading here https://github.com/vllm-project/vllm/blob/v0.11.0/vllm/model_executor/models/glm4_moe.py#L525-L547 "re:model.layers.46.*" # MTP layer (Multi-token prediction, cannot be loaded by huggingface/transformers) ], ``` FP8 quantization does not require calibration ### Deep-dive Quantization should be focused on Linear layer (also called Dense or Fully-Connected layers i.e. MatMu+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1] > LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the > LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. > Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized. _Note: Experts layers might not be stored as a `Linear` layer, meaning they might be skipped if using `llmcompressor` with a `Linear` target._ Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization: - quantizing expert FFN layers do not seriously impact model quality - quantizing cross-attention has some impact - quantizing self-attention has a large impact - quantizing dense FFN has a very significant impact Hence to preserve model quality we choose not to quantize dense FFN layers (i.e. shared experts) and self-attention layers. We notice that: - official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16: - https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json - NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16: - https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks. In this case, we keep the first layer unquantized as `"first_k_dense_replace": 1` in [config.json](config.json) ### References 1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\ Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\ https://arxiv.org/pdf/2506.12044 2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\ Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\ https://arxiv.org/pdf/2406.08155v1 3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\ Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\ https://arxiv.org/pdf/2310.02410