Update README.md
Browse files
README.md
CHANGED
|
@@ -79,7 +79,7 @@ You can try the following to squeeze out more context on your system:
|
|
| 79 |
- Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
|
| 80 |
- Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
|
| 81 |
|
| 82 |
-
## Optimizing
|
| 83 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
|
| 84 |
|
| 85 |
### Quantization Approach
|
|
|
|
| 79 |
- Reducing buffers can free up a bit more VRAM at a very minor cost to performance (`-amb 512` and `-b 1024 -ub 1024`)
|
| 80 |
- Switching to an IQ quant will save some memory at the cost of performance (*very very roughly* 10% memory savings at the cost of 10% in inference performance)
|
| 81 |
|
| 82 |
+
## Optimizing for Coding
|
| 83 |
Smaller quants, like `UD-Q2_K_XL` are much faster when generating tokens, but often produce code that fails to run or contains bugs. Based on empirical observations, coding seems to be strongly affected by the model quantization. So we use larger quantization where it matters to reduce perplexity while remaining within the target system constraints of 24GB-32GB VRAM, 512GB RAM.
|
| 84 |
|
| 85 |
### Quantization Approach
|