Let's Talk about Resource utilization of the model.

#10

by kalashshah19 - opened Aug 28

Discussion

kalashshah19

Aug 28

I want to know the GPU VRAM usage of Q4_K_XL, Q4_K_M and Q4_K_S. Can anyone tell me those who have used the model ?

Nekotekina

6 days ago

•

edited 4 days ago

It's late answer, but if you use it with -ot ffn=CPU(not the same as -cmoe), attention layers will only be about 2 GB. The rest is computation buffer and KV cache. I was able to fit insane 950k context in my 3090. I switched to Q6_K_XL which is a bit bigger than Q4_K_XL, reduced context size to 900k.

kalashshah19

4 days ago

Great, thanks !

Nekotekina

4 days ago

•

edited 4 days ago

I think Q4 allowed even 1M context although I used q4_1 quants for KV Cache. I don't remember which exact parameters I used but it should be possible.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment