anikifoss commited on
Commit
af3a922
·
verified ·
1 Parent(s): c748673

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -35,8 +35,10 @@ Use the following command line to run the model (tweak the command to further cu
35
 
36
  Customization:
37
  - Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
 
38
  - Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
39
  - Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
 
40
 
41
  TODO:
42
  - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
@@ -65,7 +67,7 @@ Generally, imatrix is not recommended for Q4 and larger quants. The problem with
65
  ## Benchmarks
66
  Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
67
 
68
- **System:** Threadripper Pro 7975WX, DDR5@5600MHz, RTX 5090 32GB
69
 
70
  The following quants were tested:
71
  - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
 
35
 
36
  Customization:
37
  - Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
38
+ - Adjust `--threads` to the number of physical cores on your system
39
  - Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
40
  - Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
41
+ - You can increase `--parallel`, but doing so will cause your context buffer (set via `--ctx-size`) to be shared between tasks executing in parallel
42
 
43
  TODO:
44
  - Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
 
67
  ## Benchmarks
68
  Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
69
 
70
+ **System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
71
 
72
  The following quants were tested:
73
  - **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)