Update README.md
Browse files
README.md
CHANGED
|
@@ -46,12 +46,28 @@ Q6_K_M : attn_v = q8_0 ffn_d = q8_0
|
|
| 46 |
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
Comparison:
|
| 50 |
|
| 51 |
Quant | size | PPL | Comment
|
| 52 |
---------|---------|------|-----------
|
| 53 |
IQ4_XS | 5.3e9 | 14.8 | -
|
| 54 |
-
|
|
|
|
| 55 |
Q6_K | 8.3e9 | 14.7 | Q6_K with default embedding and output, unstable with greedy sampling, poor performance on eval prompts
|
| 56 |
Q6_K_H | 7.3e9 | 14.8 | Hybrid quant with Q6_K embedding Q6_K output, stable with greedy sampling, excellent performance on eval prompts
|
| 57 |
|
|
@@ -68,6 +84,18 @@ solution of a complex problem, compared against a "dumber" thinking model which
|
|
| 68 |
suggests the RL training for the underlying model might have rewarded efficient solutions more than inefficient ones (hypothesis, could also just
|
| 69 |
be coincidence it happens to be "smart" on a couple tricky problems in the eval test set).
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
This is one of the strongest general reasoning models I have experienced to date as of 7/21/2025 independent of size, compared against both QwQ,
|
| 72 |
R1 distills of Qwen 2.5 models, and Qwen 3. However testing with some code problems show it is **extremely weak** on code generation problems.
|
| 73 |
|
|
@@ -79,6 +107,7 @@ Math benchmarks for the model are given here: https://huggingface.co/spaces/stea
|
|
| 79 |
| Link | Type | Size/e9 B | Notes |
|
| 80 |
|------|------|-----------|-------|
|
| 81 |
| [GLM-Z1-9B-0414.Q4_K_H.gguf](https://huggingface.co/steampunque/GLM-Z1-9B-0414-Hybrid-GGUF/resolve/main/GLM-Z1-9B-0414.Q4_K_H.gguf) | Q4_K_H | 6.6e9 B | 1.7B smaller than Q6_K with much better performance |
|
|
|
|
| 82 |
| [GLM-Z1-9B-0414.Q6_K_H.gguf](https://huggingface.co/steampunque/GLM-Z1-9B-0414-Hybrid-GGUF/resolve/main/GLM-Z1-9B-0414.Q6_K_H.gguf) | Q6_K_H | 7.3e9 B | 1B smaller than Q6_K with much better performance |
|
| 83 |
|
| 84 |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
|
|
|
|
| 46 |
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"
|
| 47 |
```
|
| 48 |
|
| 49 |
+
Q4_P_H is also available. This quant pads FFN dimension to even 256 so K quants can be used with it. Both the Q4_K_H
|
| 50 |
+
and Q6_K_H quant will replace specified layer quant with legacy quants for FFN tensors while Q4_P_H will use the exactly specified
|
| 51 |
+
K layer quants. Eliminating the legacy quants makes the size and performance more efficient since all layers
|
| 52 |
+
use K quants.
|
| 53 |
+
```
|
| 54 |
+
LAYER_TYPES='[
|
| 55 |
+
[0 ,"Q6_K_M"],[1 ,"Q5_K_L"],[2 ,"Q5_K_M"],[3 ,"Q5_K_S"],[4 ,"Q4_K_L"],[5 ,"Q4_K_M"],[6 ,"Q4_K_S"],[7 ,"Q4_K_M"],
|
| 56 |
+
[8 ,"Q4_K_S"],[9 ,"Q4_K_S"],[10,"Q4_K_S"],[11,"Q4_K_S"],[12,"Q4_K_S"],[13,"Q4_K_S"],[14,"Q4_K_S"],[15,"Q4_K_S"],
|
| 57 |
+
[16,"Q4_K_M"],[17,"Q4_K_S"],[18,"Q4_K_M"],[19,"Q4_K_S"],[20,"Q4_K_M"],[21,"Q4_K_S"],[22,"Q4_K_M"],[23,"Q4_K_S"],
|
| 58 |
+
[24,"Q4_K_M"],[25,"Q4_K_M"],[26,"Q4_K_M"],[27,"Q4_K_M"],[28,"Q4_K_L"],[29,"Q4_K_L"],[30,"Q4_K_L"],[31,"Q5_K_M"],
|
| 59 |
+
[32,"Q5_K_M"],[33,"Q5_K_M"],[34,"Q5_K_L"],[35,"Q5_K_L"],[36,"Q5_K_L"],[37,"Q6_K_S"],[38,"Q6_K_M"],[39,"Q6_K_L"]
|
| 60 |
+
]'
|
| 61 |
+
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high --tensor-pad [[13696,13824],[27392,27648,2]] --override-kv glm4.feed_forward_length=int:13824"
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
Comparison:
|
| 65 |
|
| 66 |
Quant | size | PPL | Comment
|
| 67 |
---------|---------|------|-----------
|
| 68 |
IQ4_XS | 5.3e9 | 14.8 | -
|
| 69 |
+
Q4_P_H | 6.3e9 | 14.8 | Hybrid quant with Q4_K embedding Q6_K output padded FFN tensors for K quants
|
| 70 |
+
Q4_K_H | 6.6e9 | 14.9 | Hybrid quant with Q4_K embedding Q6_K output
|
| 71 |
Q6_K | 8.3e9 | 14.7 | Q6_K with default embedding and output, unstable with greedy sampling, poor performance on eval prompts
|
| 72 |
Q6_K_H | 7.3e9 | 14.8 | Hybrid quant with Q6_K embedding Q6_K output, stable with greedy sampling, excellent performance on eval prompts
|
| 73 |
|
|
|
|
| 84 |
suggests the RL training for the underlying model might have rewarded efficient solutions more than inefficient ones (hypothesis, could also just
|
| 85 |
be coincidence it happens to be "smart" on a couple tricky problems in the eval test set).
|
| 86 |
|
| 87 |
+
The Q4_P_H quant is an improved Q4 quant enabling it to use K quants for FFN tensors instead of the fallback legacy quants. It does
|
| 88 |
+
very well on eval test set. It is smallest available high performance hybrid quant for the model.
|
| 89 |
+
|
| 90 |
+
The model can be speculated using Qwen3 0.6B if the inference engine can support dynamic vocab translation between
|
| 91 |
+
draft and target models. Approx performance using a downstream speculator with llama.cpp on a 4070 (12G VRAM) with layers and
|
| 92 |
+
context fully in GPU:
|
| 93 |
+
|
| 94 |
+
Q | QKV | ND | NKV | gen tps | Comment
|
| 95 |
+
------|-------|------|--------|------------|---------
|
| 96 |
+
Q4_P_H| F16 | 0 | 32k | 66 | No draft
|
| 97 |
+
Q4_P_H| F16 | 3 | 31k | 81 | Spec 3
|
| 98 |
+
|
| 99 |
This is one of the strongest general reasoning models I have experienced to date as of 7/21/2025 independent of size, compared against both QwQ,
|
| 100 |
R1 distills of Qwen 2.5 models, and Qwen 3. However testing with some code problems show it is **extremely weak** on code generation problems.
|
| 101 |
|
|
|
|
| 107 |
| Link | Type | Size/e9 B | Notes |
|
| 108 |
|------|------|-----------|-------|
|
| 109 |
| [GLM-Z1-9B-0414.Q4_K_H.gguf](https://huggingface.co/steampunque/GLM-Z1-9B-0414-Hybrid-GGUF/resolve/main/GLM-Z1-9B-0414.Q4_K_H.gguf) | Q4_K_H | 6.6e9 B | 1.7B smaller than Q6_K with much better performance |
|
| 110 |
+
| [GLM-Z1-9B-0414.Q4_P_H.gguf](https://huggingface.co/steampunque/GLM-Z1-9B-0414-Hybrid-GGUF/resolve/main/GLM-Z1-9B-0414.Q4_P_H.gguf) | Q4_P_H | 6.3e9 B | 2B smaller than Q6_K with much better performance |
|
| 111 |
| [GLM-Z1-9B-0414.Q6_K_H.gguf](https://huggingface.co/steampunque/GLM-Z1-9B-0414-Hybrid-GGUF/resolve/main/GLM-Z1-9B-0414.Q6_K_H.gguf) | Q6_K_H | 7.3e9 B | 1B smaller than Q6_K with much better performance |
|
| 112 |
|
| 113 |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
|