Let's Talk about Resource utilization of the model.
#10
by
kalashshah19
- opened
I want to know the GPU VRAM usage of Q4_K_XL, Q4_K_M and Q4_K_S. Can anyone tell me those who have used the model ?
It's late answer, but if you use it with -ot ffn=CPU(not the same as -cmoe), attention layers will only be about 2 GB. The rest is computation buffer and KV cache. I was able to fit insane 950k context in my 3090. I switched to Q6_K_XL which is a bit bigger than Q4_K_XL, reduced context size to 900k.
Great, thanks !
I think Q4 allowed even 1M context although I used q4_1 quants for KV Cache. I don't remember which exact parameters I used but it should be possible.