Let's Talk about Resource utilization of the model.

#10
by kalashshah19 - opened

I want to know the GPU VRAM usage of Q4_K_XL, Q4_K_M and Q4_K_S. Can anyone tell me those who have used the model ?

It's late answer, but if you use it with -ot ffn=CPU(not the same as -cmoe), attention layers will only be about 2 GB. The rest is computation buffer and KV cache. I was able to fit insane 950k context in my 3090. I switched to Q6_K_XL which is a bit bigger than Q4_K_XL, reduced context size to 900k.

Great, thanks !

I think Q4 allowed even 1M context although I used q4_1 quants for KV Cache. I don't remember which exact parameters I used but it should be possible.

Sign up or log in to comment