Where do you all launch the model? Please help.

#10
by MissByron - opened

I've been trying to make it work for a month already, I think I'm about to go insane. I can't make it go any more than 0.5 t/sec. Please help, I'm desperate at this point.

My setup is:

  • RTX 5090 (VRAM 32 GB)
  • Intel Core i9-9900K
  • RAM: 67 GB
  • Disk: 2 Tb (Adata xpg mars 980 blade [smar-980B-2TCS])
  • Ubuntu 24.04

Tried IQ2_XXS on kobold.cpp and llama.ccp, with the same settings:

  • GPU layers: 7 (if I set any more, it gives out of memory error)
  • Threads: 7/14 (either way the same outcome)
  • mmap or mlock (either way the same outcome)
  • Quantize KV Cache 4-bit (without it it's even slower)

And GPU's memory is 98% loaded, mind you, so it's not about llama.cpp not seeing it. CUDA Toolkit installed, everything's compatible, simple one-file models are flying. Trust me, I've checked ANYTHING and EVERYTHING before coming here.

My main question is: where do you all launch it? I'm definitely missing something, but I've just ran out of options.

Sign up or log in to comment