GLM-4.6-GGUF-3.2263bpw
This is a 3.2 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.
The quant aims to achieve best-in-class performance, by relying on:
- SOTA IQK-quants by @ikawrakow
- GGUF Tool Suite with the amazing calibration data by @Thireus
- Well-balanced importance matrix by @mradermacher
- Top-notch knowledge sharing by @ubergarm, @bartowski, @eaddario, @AesSedai, and many others
Size
The FFN tensors will take about 120GiB, to be loaded into System RAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.
The other tensors will take about 12GiB, to be loaded into VRAM, leaving some space for context and compute buffer.
Size from llama-server output:
llm_load_print_meta: model size = 134.269 GiB (3.233 BPW)
llm_load_print_meta: repeating layers = 133.195 GiB (3.221 BPW, 355.234 B parameters)
...
llm_load_tensors: CPU buffer size = 123220.78 MiB
llm_load_tensors: CUDA_Host buffer size = 486.20 MiB
llm_load_tensors: CUDA0 buffer size = 11829.74 MiB
System RAM usage from top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19132 sayap 20 0 185.0g 122.0g 1.4g R 99.4 97.7 1:02.77 llama-server
Quality
Recipe with a mixture of IQ4_KSS, IQ3_KS, IQ2_KL for FFN, no harmonization
## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/
## Model head & embeddings — qbits: 32 6 5
^output_norm\.weight$=f32
^token_embd\.weight$=iq5_ks
^output\.weight$=iq6_k
## Multi-headed attention parameters — qbits: 32 8 5
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight$=iq5_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=iq5_ks
## Dense Feed-Forward Network weights — qbits: 8
^blk\.[0-2]\.ffn_gate\.weight$=q8_0
^blk\.[0-2]\.ffn_down\.weight$=q8_0
^blk\.[0-2]\.ffn_up\.weight$=q8_0
## NextN tensors — qbits: 32 5
^blk\.92\.nextn\.enorm\.weight$=f32
^blk\.92\.nextn\.eh_proj\.weight$=iq5_ks
^blk\.92\.nextn\.shared_head_norm\.weight$=f32
^blk\.92\.nextn\.hnorm\.weight$=f32
## MoE Gating & Routing — qbits: 32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight$=f32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias$=f32
## Misc / Other tensors — qbits: 32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight$=f32
## GPU-loaded - MoE Shared Experts Feed-Forward Network - ffn_*_shexp
# ffn_down_shexp — down-projection (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_down_shexp\.weight$=q8_0
# ffn_up_shexp — up-projection (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_up_shexp\.weight$=q8_0
# ffn_gate_shexp — gating network (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_shexp\.weight$=q8_0
## CPU-friendly - MoE Per-expert Feed-Forward Network - ffn_*_exps
# ffn_down_exps — down-projection (per-expert) — qbits: 4 3 2
^blk\.(4[4-9]|5[0-9]|92)\.ffn_down_exps\.weight$=iq4_kss
^blk\.(18|2[4-5]|27|29|30|32|37|39|4[0-3]|6[0-9]|7[0-5])\.ffn_down_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-7]|19|2[0-3]|26|28|31|3[3-6]|38|7[6-9]|8[0-9]|9[0-1])\.ffn_down_exps\.weight$=iq2_kl
# ffn_up_exps — up-projection (per-expert) — qbits: 4 3 2
^blk\.(47|49|5[2-6]|5[8-9]|6[2-5]|6[7-9]|70|73|92)\.ffn_up_exps\.weight$=iq4_kss
^blk\.(15|24|2[7-8]|30|34|3[6-9]|4[0-6]|48|[5-6][0-1]|57|66|7[1-2]|7[4-9]|9[0-1])\.ffn_up_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-4]|1[6-9]|2[0-3]|2[5-6]|29|3[1-3]|35|8[0-9])\.ffn_up_exps\.weight$=iq2_kl
# ffn_gate_exps — gating network (per-expert) — qbits: 4 3 2
^blk\.(55|65|6[8-9]|9[1-2])\.ffn_gate_exps\.weight$=iq4_kss
^blk\.(25|2[7-8]|30|4[2-9]|[5-6][0-4]|5[6-9]|6[6-7]|7[0-9]|90)\.ffn_gate_exps\.weight$=iq3_ks
^blk\.([3-9]|1[0-9]|2[0-4]|26|29|3[1-9]|4[0-1]|8[0-9])\.ffn_gate_exps\.weight$=iq2_kl
## Summary of tensor sizes per class
# GPU Total: 12.13 GiB (100.0%) | 12.13 GiB max, if all were q8_0 | 12.13 GiB min, if all were q8_0
# CPU Total: 121.87 GiB (77.0%) | 158.24 GiB max, if all were iq4_kss | 106.90 GiB min, if all were iq2_kl
# GPU+CPU Total: 134.01 GiB (88.5%)
## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +f32 835 32 0.28 GiB - -
# q8_0 465 8.5 3.63 GiB 100.0% 3.63
# +iq6_k 1 6.625 0.60 GiB - -
# +iq5_ks 187 5.25 7.63 GiB - -
#
# CPU-friendly quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +iq5_ks 1 5.25 0.03 GiB - -
# +iq4_kss 3 4 1.76 GiB - -
# iq4_kss 39 4 22.85 GiB 14.6% 156.45
# iq3_ks 102 3.1875 47.63 GiB 38.2% 124.67
# iq2_kl 126 2.6875 49.60 GiB 47.2% 105.11
#
# -Average BPW: 3.2263
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-11-27 04:19:33 WIB+0700 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: 569b7f6a3239c9173d71ca1fadf34222607d72a2cfed2c284b42633e95b4a627
# - Calibration dataset 'models/GLM-4.6/kld_results.csv' SHA-256: c9eba4144dcd4c837fa68f3ad0620d43b9caafeec5928cfb3198034622b9166e
# - tensors.bf16.map SHA-256: fa987db60ed8e9eb4348a45cb0ad630f81a97b20292ed9adfe0369ddd3ec2828
# - tensors.bf16.map model name: GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-01760-of-01760
# - tensors.q8_0.map SHA-256: 4cdc6924a3e79e3a8df15cc40d607d01ad2d29375eb411ca423a44d437ae0c66
# - tensors.q8_0.map model name: GLM-4.6-THIREUS-Q8_0-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq6_k.map SHA-256: 0743f08065ebeeac64c52844c0a1cbde4e4e9242230d4218de7647fd46d2ba99
# - tensors.iq6_k.map model name: GLM-4.6-THIREUS-IQ6_K-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq5_ks.map SHA-256: 71bc70300215e17e1956fffd56bb0b63baaa6c09839333fa1ad59d6a47dd9455
# - tensors.iq5_ks.map model name: GLM-4.6-THIREUS-IQ5_KS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq4_kss.map SHA-256: 78d8ec2bd1bd29d56add18b1685add51d29d4f953b2b6fb27f541e33009c73f4
# - tensors.iq4_kss.map model name: GLM-4.6-THIREUS-IQ4_KSS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq3_ks.map SHA-256: 19804948e1f1bf32def27f7f4879a25a1bb22b2061720831c0841f940217bd01
# - tensors.iq3_ks.map model name: GLM-4.6-THIREUS-IQ3_KS-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq2_kl.map SHA-256: a13547ff8e7dcbc5ea25bdd506a8b224c8eef625d8a67f45b27b9e28ed3f7ea0
# - tensors.iq2_kl.map model name: GLM-4.6-THIREUS-IQ2_KL-SPECIAL_TENSOR-01760-of-01760
# - tensors.iq1_kt.map SHA-256: 793fce90c8b4c735e406f9ee701fc10d9e483d90fbdafdab533096ac7d9e748e
# - tensors.iq1_kt.map model name: GLM-4.6-THIREUS-IQ1_KT-SPECIAL_TENSOR-01760-of-01760
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py models/GLM-4.6/kld_results.csv --tolerance 0.001 --cpu-tensors-max-size 121.8 \
# --gpu-tensors-max-size 12.2 --exponential-factor 1.0 --skip-gpg --cpu-tensors \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq4_kss iq3_ks iq2_kl \
# --gpu-quants q8_0 --cpu-assign-tensors '^blk\.(92)\.ffn_down_exps\.weight=iq4_kss' \
# '^blk\.(92)\.ffn_up_exps\.weight=iq4_kss' '^blk\.(92)\.ffn_gate_exps\.weight=iq4_kss' \
# '^blk\.92\.nextn\.shared_head_head\.weight=iq5_ks' '^blk\.92\.nextn\.embed_tokens\.weight=iq5_ks' \
# '^blk\.92\.nextn\.eh_proj\.weight=iq5_ks' --gpu-assign-qtype iq5_ks --gpu-assign-tensors \
# '^blk\..*\.attn_(q|output)\.weight=iq5_ks' '^token_embd\.weight=iq5_ks' '^output\.weight=iq6_k' --harmonize-tensors \
# '' --harmonization-technique 0
## THE END!
PPL result with wiki.test.raw:
Final estimate: PPL over 565 chunks for n_ctx=512 = 3.7074 +/- 0.02174
Can check the graph from https://huggingface.co/ubergarm/GLM-4.6-GGUF for comparison.
KLD result with ddh0_imat_calibration_data_v2.txt and GLM-4.6-KLD-ref-logits-Q8_0-ddh0-imat-calibration-data-v2.bin:
====== Perplexity statistics ======
Mean PPL(Q) : 8.718119 ± 0.155248
Mean PPL(base) : 8.453608 ± 0.150165
Cor(ln(PPL(Q)), ln(PPL(base))): 98.69%
Mean ln(PPL(Q)/PPL(base)) : 0.030810 ± 0.002874
Mean PPL(Q)/PPL(base) : 1.031290 ± 0.002964
Mean PPL(Q)-PPL(base) : 0.264511 ± 0.025186
====== KL divergence statistics ======
Mean KLD: 0.067958 ± 0.001349
Maximum KLD: 6.633387
99.9% KLD: 2.941191
99.0% KLD: 0.885563
95.0% KLD: 0.257275
90.0% KLD: 0.138484
Median KLD: 0.020575
10.0% KLD: 0.000084
5.0% KLD: 0.000021
1.0% KLD: 0.000002
0.1% KLD: -0.000001
Minimum KLD: -0.000014
====== Token probability statistics ======
Mean Δp: -0.786 ± 0.051 %
Maximum Δp: 95.627%
99.9% Δp: 54.617%
99.0% Δp: 20.354%
95.0% Δp: 6.130%
90.0% Δp: 2.493%
75.0% Δp: 0.180%
Median Δp: -0.006%
25.0% Δp: -0.639%
10.0% Δp: -4.813%
5.0% Δp: -10.051%
1.0% Δp: -33.562%
0.1% Δp: -74.282%
Minimum Δp: -98.674%
RMS Δp : 8.061 ± 0.152 %
Same top p: 86.423 ± 0.217 %
Can check the graph from https://www.reddit.com/r/LocalLLaMA/comments/1nwimej/comment/nhg9jnn/ for comparison.
- Downloads last month
- 261
We're not able to determine the quantization variants.
Model tree for sokann/GLM-4.6-GGUF-3.2263bpw
Base model
zai-org/GLM-4.6