Instructions to use nvidia/NVLM-D-72B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVLM-D-72B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/NVLM-D-72B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import NVLM_D model = NVLM_D.from_pretrained("nvidia/NVLM-D-72B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/NVLM-D-72B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVLM-D-72B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVLM-D-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/NVLM-D-72B
- SGLang
How to use nvidia/NVLM-D-72B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVLM-D-72B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVLM-D-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVLM-D-72B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVLM-D-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/NVLM-D-72B with Docker Model Runner:
docker model run hf.co/nvidia/NVLM-D-72B
low memory usage
Is there any way to use low memory, as my GPU only has 24 GB. I get this error message torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU
@Knut-J 24GB isn't gonna cut it for this beast of a model. NVLM-D 72B is huge. But don't give up yet! Try these tricks:
- CPU offloading: Use device_map="auto" when loading the model. It'll be slow as molasses, but it might just work.
- 8-bit quantization: Add load_in_8bit=True to your model loading. It'll sacrifice some quality, but hey, beggars can't be choosers.
Last resort: Downgrade to a smaller model. Sometimes you gotta know when to fold 'em.
Fair warning: These hacks might make your inference slower than a snail on tranquilizers. But if you're dead set on using this model, it's worth a shot. Good luck!
Anybody use an external hard drive to run this?
Hi,
I have tried to run the inference code given here on AWS p3dn.24xlarge and p4de.24xlarge facing an space error.
But facing issues
OSError: [Errno 28] No space left on device
specs of the instance
Have tried the following the tips given here https://discuss.huggingface.co/t/no-space-left-on-device-when-downloading-a-large-model-for-the-sagemaker-training-job/43643
Any help is appreciated, please let me know if I am missing something
Thanks in advance
Hi Malini,
Thank you for your interests in our model.
From the first screenshot, it shows that your home directory has not enough disk space. Running NVLM-D requires at least around 200GB of disk space.
Your second screenshot suggests that it is likely that you have separate disks.
Please try run the following command on the disk with 1000GB.
- Install Git Large File Storage (LFS) by running:
git lfs install
- Clone the NVLM-D repository using:
git clone https://huggingface.co/nvidia/NVLM-D-72B
(say this clones the model into your local path "path/to/NVLM-D-72B")
- Load the model with your local path
path = "path/to/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=device_map).eval()
Let me know if you encounter any issues, and I’d be happy to assist further!
Best,
Boxin
Update!
Thank you @boxin-wbx ! Issue was fixed when i installed the LFS and cloned the repo.
I did need to run the device_map() to run the inference on text.
When I run the inference on images, am getting a memory error.Sharing the screenshot below
We haven't tested on V100 before. But a node with 2 H100 / A100 GPUs (each with 80GB of memory) should work.


