Where do you all launch the model? Please help.

#10

by MissByron - opened Sep 19

Sep 19

I've been trying to make it work for a month already, I think I'm about to go insane. I can't make it go any more than 0.5 t/sec. Please help, I'm desperate at this point.

My setup is:

RTX 5090 (VRAM 32 GB)
Intel Core i9-9900K
RAM: 67 GB
Disk: 2 Tb (Adata xpg mars 980 blade [smar-980B-2TCS])
Ubuntu 24.04

Tried IQ2_XXS on kobold.cpp and llama.ccp, with the same settings:

GPU layers: 7 (if I set any more, it gives out of memory error)
Threads: 7/14 (either way the same outcome)
mmap or mlock (either way the same outcome)
Quantize KV Cache 4-bit (without it it's even slower)

And GPU's memory is 98% loaded, mind you, so it's not about llama.cpp not seeing it. CUDA Toolkit installed, everything's compatible, simple one-file models are flying. Trust me, I've checked ANYTHING and EVERYTHING before coming here.

My main question is: where do you all launch it? I'm definitely missing something, but I've just ran out of options.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment