Update README.md
Browse files
README.md
CHANGED
|
@@ -73,7 +73,16 @@ TODO:
|
|
| 73 |
- Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
|
| 74 |
- Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
|
| 78 |
|
| 79 |
Quantization Summary:
|
|
@@ -84,19 +93,11 @@ Quantization Summary:
|
|
| 84 |
|
| 85 |
The **attn_kv_b** layers are included in the original model, but they contain the same information as **attn_k_b** and **attn_v_b** layers. Some quants, like `unsloth`, remove **attn_kv_b** layers altogether. We keep these layers for completeness, but push them out of VRAM with `attn_kv_b=CPU` when running the model.
|
| 86 |
|
| 87 |
-
|
| 88 |
-
You can try the following to squeeze out more context on your system:
|
| 89 |
-
- Running with `-ctk q8_0` can save some VRAM, but is a little slower on the target system
|
| 90 |
-
- Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
|
| 91 |
-
- Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
|
| 92 |
-
|
| 93 |
-
## No imatrix
|
| 94 |
Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
|
| 95 |
|
| 96 |
## Benchmarks
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
**System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
|
| 100 |
|
| 101 |
The following quants were tested:
|
| 102 |
- **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
|
|
|
|
| 73 |
- Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
|
| 74 |
- Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
|
| 75 |
|
| 76 |
+
### Inference Performance vs VRAM Considerations
|
| 77 |
+
You can try the following to squeeze out more context on your system:
|
| 78 |
+
- Running with `-ctk q8_0` can save some VRAM, but is a little slower on the target system
|
| 79 |
+
- Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
|
| 80 |
+
- Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
|
| 81 |
+
|
| 82 |
+
## Optimizing For coding
|
| 83 |
+
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
|
| 84 |
+
|
| 85 |
+
### Quantization Approach
|
| 86 |
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
|
| 87 |
|
| 88 |
Quantization Summary:
|
|
|
|
| 93 |
|
| 94 |
The **attn_kv_b** layers are included in the original model, but they contain the same information as **attn_k_b** and **attn_v_b** layers. Some quants, like `unsloth`, remove **attn_kv_b** layers altogether. We keep these layers for completeness, but push them out of VRAM with `attn_kv_b=CPU` when running the model.
|
| 95 |
|
| 96 |
+
### No imatrix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
|
| 98 |
|
| 99 |
## Benchmarks
|
| 100 |
+
**Benchmark System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
|
|
|
|
|
|
|
| 101 |
|
| 102 |
The following quants were tested:
|
| 103 |
- **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
|