anikifoss
/

DeepSeek-R1-0528-DQ4_K_R4

Text Generation

Model card Files Files and versions

anikifoss commited on Jun 2

Commit

94b30a7

·

verified ·

1 Parent(s): 93d47bc

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -74,7 +74,7 @@ TODO:
 - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
 ## Quantization Approach
-When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model. We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
 Quantization Summary:
 - Keep all the small `F32` layers untouched

 - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
 ## Quantization Approach
+When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
 Quantization Summary:
 - Keep all the small `F32` layers untouched