Update README.md
Browse files
README.md
CHANGED
|
@@ -35,8 +35,10 @@ Use the following command line to run the model (tweak the command to further cu
|
|
| 35 |
|
| 36 |
Customization:
|
| 37 |
- Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
|
|
|
|
| 38 |
- Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
|
| 39 |
- Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
|
|
|
|
| 40 |
|
| 41 |
TODO:
|
| 42 |
- Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
|
|
@@ -65,7 +67,7 @@ Generally, imatrix is not recommended for Q4 and larger quants. The problem with
|
|
| 65 |
## Benchmarks
|
| 66 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
|
| 67 |
|
| 68 |
-
**System:** Threadripper Pro 7975WX, DDR5@5600MHz, RTX 5090 32GB
|
| 69 |
|
| 70 |
The following quants were tested:
|
| 71 |
- **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
|
|
|
|
| 35 |
|
| 36 |
Customization:
|
| 37 |
- Replace `/mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4` with the location of the model (where you downloaded it)
|
| 38 |
+
- Adjust `--threads` to the number of physical cores on your system
|
| 39 |
- Tweak these to your preference `--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0`
|
| 40 |
- Add `--no-mmap` to force the model to be fully loaded into memory (this is especially important when running inference speed benchmarks)
|
| 41 |
+
- You can increase `--parallel`, but doing so will cause your context buffer (set via `--ctx-size`) to be shared between tasks executing in parallel
|
| 42 |
|
| 43 |
TODO:
|
| 44 |
- Experiment with new `-mla 3` (recent **ik_llama** patches enable new MLA implementation on CUDA)
|
|
|
|
| 67 |
## Benchmarks
|
| 68 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matter to reduce perplexity while remaining within the target system constraints of 32G VRAM, 512G RAM.
|
| 69 |
|
| 70 |
+
**System:** Threadripper Pro 7975WX, 768GB DDR5@5600MHz, RTX 5090 32GB
|
| 71 |
|
| 72 |
The following quants were tested:
|
| 73 |
- **Q2_K_R4** (attention - `Q8_0`, all MoE - `Q2_K_R4`)
|