Does it work in Ollama? How do you assemble the parts?

by Colonino - opened 20 days ago

20 days ago

Newbie question. I have a Geforce 5090 so this should work for me. Will the NVFP4 format work in Ollama? Even newbie-er: I have the fifteen parts downloaded. How do I assemble them? Thanks!

mratsim

Owner 19 days ago

I don't think it does, this is for vLLM (what I tested), SGLang or TensorRT.

You might be able to follow through the following instructions for ollama: https://ollama.readthedocs.io/en/import/#importing-a-fine-tuned-adapter-from-safetensors-weights but that seems overkill.

As a total newbie, the easiest way to run models by just downloading + launching a command would be through KoboldCpp https://github.com/LostRuins/koboldcpp and loading GGUF from either:

(no need to combine them, KoboldCpp will do load them all if they are in the same folder)

The reason why I use vLLM is because it has much better context processing once you reach large multi-turn.

Now regarding your hardware, a single RTX5090 does not have enough VRAM for a comfortable experience. You can estimate max token/s by dividing the memory bandwidth divided the model size:

a RTX 5090 1800GB/s / ~70GB -> 25.8 tok/s
a CPU with overclocked dual-channel DDR5 memory: 80GB/s / ~70GB -> 1.15 tok/s

Given that context MUST be on GPU, say 10GB of it, that leaves you 20GB on GPU and 50GB on CPU, so your speed ceiling is 20/70 * 25.8 = 7.4 tok/s

In practice on a RTX Pro 6000 (samee memory bandwidth as RTX 5090 but 96GB VRAM) I get up to 17 tok/s generation speed so you might get 5 tok/s or so.

If 5 tok/s is enough for you then that's OK, otherwise I recommend you use a Mixture-of-Experts model where only a small portion of the model is activated per token (a token is a minimum a letter, to a s big as a small word). For example from TheDrummer you have, https://huggingface.co/TheDrummer/GLM-Steam-106B-A12B-v1

Colonino

19 days ago

Thank you very much for the time you took to write a detailed and clear explanation. Yes, I've run on my system the GGUF Q5_K_M version of Behemoth-X and it is pretty slow but I liked the results and had hopes a new-to-me format specifically made for Blackwell would work better. Your answer and some back and forth with ChatGPT clarified that Ollama so far only supports GGUF and MXFP4 (to run the gpt-oss models). Maybe someday they'll add support for NVFP4, but probably it's not high on their list! (Or maybe people will start formatting more models using MXFP4...)

mratsim changed discussion status to closed 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment