trying to run this on a 4090 and 192GB RAM.. not enough RAM???

#10
by MikaSouthworth - opened

it eats until 185GB RAM... in vllm... how can I stop it from doing that? does it need that much really?? it goes there, then says "not enough for cache" and crashes. or just crashes with "OOM"
I have newest flashinfer (using that as attention backend after triton crashed even sooner), newest xformers 0.0.33.dev1091, newest 11.1 vllm (cloned the repo and installed it that way).... I don't know what else to do
got a ryzen AMD 16 cores
transformers worked, but when offloading, it only used 1 of my CPU cores, no matter what I tried.............. I'd try quantization, but it needs a config file for those that don't randomly quantize everything and risk breaking the model

Use cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit for vllm with 48GB NVRAM.

I run it on 2x3090 or 2x4090. Will not run with vllm with only one card.

You will have to wait for GGUF for cpu offloading of experts when llama.cpp will have it working.

use it with enforce eager as true

Sign up or log in to comment