anikifoss
/

DeepSeek-R1-0528-DQ4_K_R4

Text Generation

Model card Files Files and versions

anikifoss commited on May 31

Commit

af3a922

·

verified ·

1 Parent(s): c748673

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -35,8 +35,10 @@ Use the following command line to run the model (tweak the command to further cu
 Customization:
 - Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
 - Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
 - Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
 TODO:
 - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
@@ -65,7 +67,7 @@ Generally, imatrix is not recommended for Q4 and larger quants. The problem with
 ## Benchmarks
 Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
-**System:** Threadripper Pro 7975WX, DDR5@5600MHz, RTX 5090 32GB
 The following quants were tested:
 - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)

 Customization:
 - Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
+- Adjust `--threads` to the number of physical cores on your system
 - Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
 - Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
+- You can increase `--parallel`, but doing so will cause your context buffer (set via `--ctx-size`) to be shared between tasks executing in parallel
 TODO:
 - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
 ## Benchmarks
 Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
+**System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
 The following quants were tested:
 - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)