anikifoss commited on
Commit
470cd34
·
verified ·
1 Parent(s): 94b30a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -11
README.md CHANGED
@@ -73,7 +73,16 @@ TODO:
73
  - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
74
  - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
75
 
76
- ## Quantization Approach
 
 
 
 
 
 
 
 
 
77
  When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
78
 
79
  Quantization Summary:
@@ -84,19 +93,11 @@ Quantization Summary:
84
 
85
  The **attn_kv_b** layers are included in the original model, but they contain the same information as **attn_k_b** and **attn_v_b** layers. Some quants, like `unsloth`, remove **attn_kv_b** layers altogether. We keep these layers for completeness, but push them out of VRAM with `attn_kv_b=CPU` when running the model.
86
 
87
- ## Inference Performance vs VRAM Considerations
88
- You can try the following to squeeze out more context on your system:
89
- - Running with `-ctk q8_0` can save some VRAM, but is a little slower on the target system
90
- - Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
91
- - Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
92
-
93
- ## No imatrix
94
  Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
95
 
96
  ## Benchmarks
97
- Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
98
-
99
- **System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
100
 
101
  The following quants were tested:
102
  - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
 
73
  - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
74
  - Re-evaluate `-rtr` (in case Q8_0 can be repacked as Q8_0_R8 after some of the recent patches)
75
 
76
+ ### Inference Performance vs VRAM Considerations
77
+ You can try the following to squeeze out more context on your system:
78
+ - Running with `-ctk q8_0` can save some VRAM, but is a little slower on the target system
79
+ - Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
80
+ - Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
81
+
82
+ ## Optimizing For coding
83
+ Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
84
+
85
+ ### Quantization Approach
86
  When running with **Flash MLA** optimization enabled, **ik_llama** will unpack **attention** layers into `Q8_0`, so we match that in our model (similar to ubergarm's ik_llama.cpp quants). We also keep all the other small layers as `Q8_0` while also leaving any `F32` layers untouched. The MoE layers make up the bulk of the model. The **ffn_down_exps** layers are especially sensitive to quantization (we borrow this idea from `unsloth` quants), so we quantize them as `Q6_K_R4`. Finally, all the other large MoE layers (ffn_up_exps, ffn_gate_exps) are quantized as `Q4_K_R4`
87
 
88
  Quantization Summary:
 
93
 
94
  The **attn_kv_b** layers are included in the original model, but they contain the same information as **attn_k_b** and **attn_v_b** layers. Some quants, like `unsloth`, remove **attn_kv_b** layers altogether. We keep these layers for completeness, but push them out of VRAM with `attn_kv_b=CPU` when running the model.
95
 
96
+ ### No imatrix
 
 
 
 
 
 
97
  Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
98
 
99
  ## Benchmarks
100
+ **Benchmark System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
 
 
101
 
102
  The following quants were tested:
103
  - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)