anikifoss
/

DeepSeek-R1-0528-DQ4_K_R4

@@ -15,17 +15,17 @@ See [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/
 ## Run
 Use the following command lines to run the model (tweak the command to further customize it to your needs).
-### 32GB VRAM
 ```
 ./build/bin/llama-server \
     --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 \
     --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
     --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
-    --ctx-size 75000 \
-    -ctk f16 \
     -mla 2 -fa \
-    -amb 1024 \
-    -b 2048 -ub 2048 \
     -fmoe \
     --n-gpu-layers 99 \
     --override-tensor exps=CPU,attn_kv_b=CPU \
@@ -35,17 +35,17 @@ Use the following command lines to run the model (tweak the command to further c
     --port 8090
 ```
-### 24GB VRAM
 ```
 ./build/bin/llama-server \
     --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 \
     --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
     --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
-    --ctx-size 41000 \
-    -ctk q8_0 \
     -mla 2 -fa \
-    -amb 512 \
-    -b 1024 -ub 1024 \
     -fmoe \
     --n-gpu-layers 99 \
     --override-tensor exps=CPU,attn_kv_b=CPU \
@@ -87,7 +87,7 @@ You can try the following to squeeze out more context on your system:
 Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
 ## Benchmarks
-Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
 **System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB

 ## Run
 Use the following command lines to run the model (tweak the command to further customize it to your needs).
+### 24GB VRAM
 ```
 ./build/bin/llama-server \
     --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 \
     --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
     --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
+    --ctx-size 41000 \
+    -ctk q8_0 \
     -mla 2 -fa \
+    -amb 512 \
+    -b 1024 -ub 1024 \
     -fmoe \
     --n-gpu-layers 99 \
     --override-tensor exps=CPU,attn_kv_b=CPU \
     --port 8090
 ```
+### 32GB VRAM
 ```
 ./build/bin/llama-server \
     --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 \
     --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
     --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
+    --ctx-size 75000 \
+    -ctk f16 \
     -mla 2 -fa \
+    -amb 1024 \
+    -b 2048 -ub 2048 \
     -fmoe \
     --n-gpu-layers 99 \
     --override-tensor exps=CPU,attn_kv_b=CPU \
 Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.
 ## Benchmarks
+Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
 **System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB