--- quantized_by: AesSedai pipeline_tag: text-generation base_model: zai-org/GLM-4.5 license: mit base_model_relation: quantized --- ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5 This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8. See [Ubergarm's GLM-4.5 quants](https://huggingface.co/ubergarm/GLM-4.5-GGUF) for info on how to use the recipe or make your own quant. ## IQ2_KT: 109.269 GiB (2.619 BPW), Final estimate: PPL = 4.1170 +/- 0.02457
👈 Recipe ```bash # 93 Repeating Layers [0-92] # Attention blk\..*\.attn_q.*=iq4_k blk\..*\.attn_k.*=iq6_k blk\..*\.attn_v.*=iq6_k blk\..*\.attn_output.*=iq5_ks # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_ks blk\..*\.ffn_(gate|up)\.weight=iq3_ks # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq6_k blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq3_kt blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq4_k blk\..*\.nextn\.shared_head_head\.weight=iq6_k blk\..*\.nextn\.eh_proj\.weight=iq6_k # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq6_k ```
## IQ4_KSS: 176.499 GiB (4.231 BPW), Final estimate: PPL = 3.3031 +/- 0.01871
👈 Recipe ```bash # 93 Repeating Layers [0-92] # Attention blk\.(0|1|2)\.attn_q.*=q8_0 blk\.(0|1|2)\.attn_k.*=q8_0 blk\.(0|1|2)\.attn_v.*=q8_0 blk\.(0|1|2)\.attn_output.*=q8_0 blk\..*\.attn_q.*=iq6_k blk\..*\.attn_k.*=iq6_k blk\..*\.attn_v.*=iq6_k blk\..*\.attn_output.*=iq6_k # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq5_ks blk\..*\.ffn_(gate|up)\.weight=iq4_ks # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq4_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq5_ks blk\..*\.nextn\.shared_head_head\.weight=iq5_ks blk\..*\.nextn\.eh_proj\.weight=q8_0 # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq6_k ```
## IQ4_KS-IQ4_KS-IQ5_KS: 200.326 GiB (4.802 BPW), Final estimate: PPL = TBD (but better than IQ5_K)
👈 Recipe ```bash Default quant level @ Q8_0 # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts Layers [3-92] blk\..*\.ffn_up_exps\.weight=iq4_ks blk\..*\.ffn_gate_exps\.weight=iq4_ks blk\..*\.ffn_down_exps\.weight=iq5_ks ```
## IQ5_K: 204.948 GiB (4.913 BPW), Final estimate: PPL = 3.1992 +/- 0.01801
👈 Recipe ```bash # 93 Repeating Layers [0-92] # Attention blk\.(0|1|2)\.attn_q.*=q8_0 blk\.(0|1|2)\.attn_k.*=q8_0 blk\.(0|1|2)\.attn_v.*=q8_0 blk\.(0|1|2)\.attn_output.*=q8_0 blk\..*\.attn_q.*=iq5_k blk\..*\.attn_k.*=iq5_k blk\..*\.attn_v.*=iq5_k blk\..*\.attn_output.*=iq5_k # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq5_k blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq5_k blk\..*\.nextn\.shared_head_head\.weight=iq5_k blk\..*\.nextn\.eh_proj\.weight=q8_0 # Non-Repeating Layers token_embd\.weight=q8_0 output\.weight=q8_0 ```