Update README.md
Browse files
README.md
CHANGED
|
@@ -74,7 +74,7 @@ TODO:
|
|
| 74 |
- Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
|
| 75 |
|
| 76 |
## Quantization Approach
|
| 77 |
-
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model. We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
|
| 78 |
|
| 79 |
Quantization Summary:
|
| 80 |
- Keep all the small `F32` layers untouched
|
|
|
|
| 74 |
- Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
|
| 75 |
|
| 76 |
## Quantization Approach
|
| 77 |
+
When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
|
| 78 |
|
| 79 |
Quantization Summary:
|
| 80 |
- Keep all the small `F32` layers untouched
|