Where do you all launch the model? Please help.
#10
by
MissByron
- opened
I've been trying to make it work for a month already, I think I'm about to go insane. I can't make it go any more than 0.5 t/sec. Please help, I'm desperate at this point.
My setup is:
- RTX 5090 (VRAM 32 GB)
- Intel Core i9-9900K
- RAM: 67 GB
- Disk: 2 Tb (Adata xpg mars 980 blade [smar-980B-2TCS])
- Ubuntu 24.04
Tried IQ2_XXS on kobold.cpp and llama.ccp, with the same settings:
- GPU layers: 7 (if I set any more, it gives out of memory error)
- Threads: 7/14 (either way the same outcome)
- mmap or mlock (either way the same outcome)
- Quantize KV Cache 4-bit (without it it's even slower)
And GPU's memory is 98% loaded, mind you, so it's not about llama.cpp not seeing it. CUDA Toolkit installed, everything's compatible, simple one-file models are flying. Trust me, I've checked ANYTHING and EVERYTHING before coming here.
My main question is: where do you all launch it? I'm definitely missing something, but I've just ran out of options.