anikifoss commited on
Commit
94b30a7
·
verified ·
1 Parent(s): 93d47bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -74,7 +74,7 @@ TODO:
74
  - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
75
 
76
  ## Quantization Approach
77
- When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model. We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
78
 
79
  Quantization Summary:
80
  - Keep all the small `F32` layers untouched
 
74
  - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
75
 
76
  ## Quantization Approach
77
+ When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
78
 
79
  Quantization Summary:
80
  - Keep all the small `F32` layers untouched