GLM-4.6-GGUF-3.2263bpw

This is a 3.2 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.

The quant aims to achieve best-in-class performance, by relying on:

SOTA IQK-quants by @ikawrakow
GGUF Tool Suite with the amazing calibration data by @Thireus
Well-balanced importance matrix by @mradermacher
Top-notch knowledge sharing by @ubergarm, @bartowski, @eaddario, @AesSedai, and many others

Size

The FFN tensors will take about 120GiB, to be loaded into System RAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.

The other tensors will take about 12GiB, to be loaded into VRAM, leaving some space for context and compute buffer.

Size from llama-server output:

llm_load_print_meta: model size       = 134.269 GiB (3.233 BPW)
llm_load_print_meta: repeating layers = 133.195 GiB (3.221 BPW, 355.234 B parameters)
...
llm_load_tensors:        CPU buffer size = 123220.78 MiB
llm_load_tensors:  CUDA_Host buffer size =   486.20 MiB
llm_load_tensors:      CUDA0 buffer size = 11829.74 MiB

System RAM usage from top output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
19132 sayap     20   0  185.0g 122.0g   1.4g R  99.4  97.7   1:02.77 llama-server

Quality

Recipe with a mixture of IQ4_KSS, IQ3_KS, IQ2_KL for FFN, no harmonization

## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/

## Model head & embeddings — qbits: 32 6 5 
^output_norm\.weight$=f32
^token_embd\.weight$=iq5_ks
^output\.weight$=iq6_k

## Multi-headed attention parameters — qbits: 32 8 5 
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight$=iq5_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=iq5_ks

## Dense Feed-Forward Network weights — qbits: 8 
^blk\.[0-2]\.ffn_gate\.weight$=q8_0
^blk\.[0-2]\.ffn_down\.weight$=q8_0
^blk\.[0-2]\.ffn_up\.weight$=q8_0

## NextN tensors — qbits: 32 5 
^blk\.92\.nextn\.enorm\.weight$=f32
^blk\.92\.nextn\.eh_proj\.weight$=iq5_ks
^blk\.92\.nextn\.shared_head_norm\.weight$=f32
^blk\.92\.nextn\.hnorm\.weight$=f32

## MoE Gating & Routing — qbits: 32 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight$=f32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias$=f32

## Misc / Other tensors — qbits: 32 
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight$=f32

## GPU-loaded - MoE Shared Experts Feed-Forward Network - ffn_*_shexp
# ffn_down_shexp — down-projection (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_down_shexp\.weight$=q8_0

# ffn_up_shexp — up-projection (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_up_shexp\.weight$=q8_0

# ffn_gate_shexp — gating network (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_shexp\.weight$=q8_0

## CPU-friendly - MoE Per-expert Feed-Forward Network - ffn_*_exps
# ffn_down_exps — down-projection (per-expert) — qbits: 4 3 2 
^blk\.(4[4-9]|5[0-9]|92)\.ffn_down_exps\.weight$=iq4_kss
^blk\.(18|2[4-5]|27|29|30|32|37|39|4[0-3]|6[0-9]|7[0-5])\.ffn_down_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-7]|19|2[0-3]|26|28|31|3[3-6]|38|7[6-9]|8[0-9]|9[0-1])\.ffn_down_exps\.weight$=iq2_kl

# ffn_up_exps — up-projection (per-expert) — qbits: 4 3 2 
^blk\.(47|49|5[2-6]|5[8-9]|6[2-5]|6[7-9]|70|73|92)\.ffn_up_exps\.weight$=iq4_kss
^blk\.(15|24|2[7-8]|30|34|3[6-9]|4[0-6]|48|[5-6][0-1]|57|66|7[1-2]|7[4-9]|9[0-1])\.ffn_up_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-4]|1[6-9]|2[0-3]|2[5-6]|29|3[1-3]|35|8[0-9])\.ffn_up_exps\.weight$=iq2_kl

# ffn_gate_exps — gating network (per-expert) — qbits: 4 3 2 
^blk\.(55|65|6[8-9]|9[1-2])\.ffn_gate_exps\.weight$=iq4_kss
^blk\.(25|2[7-8]|30|4[2-9]|[5-6][0-4]|5[6-9]|6[6-7]|7[0-9]|90)\.ffn_gate_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-9]|2[0-4]|26|29|3[1-9]|4[0-1]|8[0-9])\.ffn_gate_exps\.weight$=iq2_kl

## Summary of tensor sizes per class
# GPU Total: 12.13 GiB (100.0%) | 12.13 GiB max, if all were q8_0 | 12.13 GiB min, if all were q8_0
# CPU Total: 121.87 GiB (77.0%) | 158.24 GiB max, if all were iq4_kss | 106.90 GiB min, if all were iq2_kl
# GPU+CPU Total: 134.01 GiB (88.5%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +f32       	835	32    	  0.28 GiB	-		-
# q8_0      	465	8.5   	  3.63 GiB	100.0%		3.63
# +iq6_k     	1  	6.625 	  0.60 GiB	-		-
# +iq5_ks    	187	5.25  	  7.63 GiB	-		-
#
# CPU-friendly quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +iq5_ks    	1  	5.25  	  0.03 GiB	-		-
# +iq4_kss   	3  	4     	  1.76 GiB	-		-
# iq4_kss   	39 	4     	 22.85 GiB	14.6%		156.45
# iq3_ks    	102	3.1875	 47.63 GiB	38.2%		124.67
# iq2_kl    	126	2.6875	 49.60 GiB	47.2%		105.11
#
# -Average BPW: 3.2263
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-11-27 04:19:33 WIB+0700 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: 569b7f6a3239c9173d71ca1fadf34222607d72a2cfed2c284b42633e95b4a627
# - Calibration dataset 'models/GLM-4.6/kld_results.csv' SHA-256: c9eba4144dcd4c837fa68f3ad0620d43b9caafeec5928cfb3198034622b9166e
# - tensors.bf16.map SHA-256: fa987db60ed8e9eb4348a45cb0ad630f81a97b20292ed9adfe0369ddd3ec2828
# - tensors.bf16.map model name: GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-01760-of-01760
# - tensors.q8_0.map SHA-256: 4cdc6924a3e79e3a8df15cc40d607d01ad2d29375eb411ca423a44d437ae0c66
# - tensors.q8_0.map model name: GLM-4.6-THIREUS-Q8_0-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq6_k.map SHA-256: 0743f08065ebeeac64c52844c0a1cbde4e4e9242230d4218de7647fd46d2ba99
# - tensors.iq6_k.map model name: GLM-4.6-THIREUS-IQ6_K-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq5_ks.map SHA-256: 71bc70300215e17e1956fffd56bb0b63baaa6c09839333fa1ad59d6a47dd9455
# - tensors.iq5_ks.map model name: GLM-4.6-THIREUS-IQ5_KS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq4_kss.map SHA-256: 78d8ec2bd1bd29d56add18b1685add51d29d4f953b2b6fb27f541e33009c73f4
# - tensors.iq4_kss.map model name: GLM-4.6-THIREUS-IQ4_KSS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq3_ks.map SHA-256: 19804948e1f1bf32def27f7f4879a25a1bb22b2061720831c0841f940217bd01
# - tensors.iq3_ks.map model name: GLM-4.6-THIREUS-IQ3_KS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq2_kl.map SHA-256: a13547ff8e7dcbc5ea25bdd506a8b224c8eef625d8a67f45b27b9e28ed3f7ea0
# - tensors.iq2_kl.map model name: GLM-4.6-THIREUS-IQ2_KL-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq1_kt.map SHA-256: 793fce90c8b4c735e406f9ee701fc10d9e483d90fbdafdab533096ac7d9e748e
# - tensors.iq1_kt.map model name: GLM-4.6-THIREUS-IQ1_KT-SPECIAL_TENSOR-01760-of-01760
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py models/GLM-4.6/kld_results.csv --tolerance 0.001 --cpu-tensors-max-size 121.8 \
# --gpu-tensors-max-size 12.2 --exponential-factor 1.0 --skip-gpg --cpu-tensors \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq4_kss iq3_ks iq2_kl \
# --gpu-quants q8_0 --cpu-assign-tensors '^blk\.(92)\.ffn_down_exps\.weight=iq4_kss' \
# '^blk\.(92)\.ffn_up_exps\.weight=iq4_kss' '^blk\.(92)\.ffn_gate_exps\.weight=iq4_kss' \
# '^blk\.92\.nextn\.shared_head_head\.weight=iq5_ks' '^blk\.92\.nextn\.embed_tokens\.weight=iq5_ks' \
# '^blk\.92\.nextn\.eh_proj\.weight=iq5_ks' --gpu-assign-qtype iq5_ks --gpu-assign-tensors \
# '^blk\..*\.attn_(q|output)\.weight=iq5_ks' '^token_embd\.weight=iq5_ks' '^output\.weight=iq6_k' --harmonize-tensors \
# '' --harmonization-technique 0

## THE END!

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.7074 +/- 0.02174

Can check the graph from https://huggingface.co/ubergarm/GLM-4.6-GGUF for comparison.

KLD result with ddh0_imat_calibration_data_v2.txt and GLM-4.6-KLD-ref-logits-Q8_0-ddh0-imat-calibration-data-v2.bin:

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.718119 ±   0.155248
Mean PPL(base)                :   8.453608 ±   0.150165
Cor(ln(PPL(Q)), ln(PPL(base))):  98.69%
Mean ln(PPL(Q)/PPL(base))     :   0.030810 ±   0.002874
Mean PPL(Q)/PPL(base)         :   1.031290 ±   0.002964
Mean PPL(Q)-PPL(base)         :   0.264511 ±   0.025186

====== KL divergence statistics ======
Mean    KLD:   0.067958 ±   0.001349
Maximum KLD:   6.633387
99.9%   KLD:   2.941191
99.0%   KLD:   0.885563
95.0%   KLD:   0.257275
90.0%   KLD:   0.138484
Median  KLD:   0.020575
10.0%   KLD:   0.000084
 5.0%   KLD:   0.000021
 1.0%   KLD:   0.000002
 0.1%   KLD:  -0.000001
Minimum KLD:  -0.000014

====== Token probability statistics ======
Mean    Δp: -0.786 ± 0.051 %
Maximum Δp: 95.627%
99.9%   Δp: 54.617%
99.0%   Δp: 20.354%
95.0%   Δp:  6.130%
90.0%   Δp:  2.493%
75.0%   Δp:  0.180%
Median  Δp: -0.006%
25.0%   Δp: -0.639%
10.0%   Δp: -4.813%
 5.0%   Δp: -10.051%
 1.0%   Δp: -33.562%
 0.1%   Δp: -74.282%
Minimum Δp: -98.674%
RMS Δp    :  8.061 ± 0.152 %
Same top p: 86.423 ± 0.217 %

Can check the graph from https://www.reddit.com/r/LocalLLaMA/comments/1nwimej/comment/nhg9jnn/ for comparison.

Downloads last month: 261

GGUF

Model size

357B params

Architecture

glm4moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/GLM-4.6-GGUF-3.2263bpw

Base model

zai-org/GLM-4.6

Quantized

(37)

this model