Instructions to use nvidia/NVLM-D-72B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVLM-D-72B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/NVLM-D-72B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import NVLM_D
model = NVLM_D.from_pretrained("nvidia/NVLM-D-72B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/NVLM-D-72B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVLM-D-72B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVLM-D-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVLM-D-72B

SGLang

How to use nvidia/NVLM-D-72B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVLM-D-72B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVLM-D-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVLM-D-72B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVLM-D-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVLM-D-72B with Docker Model Runner:
```
docker model run hf.co/nvidia/NVLM-D-72B
```

low memory usage

#10

by Knut-J - opened Oct 5, 2024

Discussion

Knut-J

Oct 5, 2024

Is there any way to use low memory, as my GPU only has 24 GB. I get this error message torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU

fatihburakkaragoz

Oct 5, 2024

@Knut-J 24GB isn't gonna cut it for this beast of a model. NVLM-D 72B is huge. But don't give up yet! Try these tricks:

CPU offloading: Use device_map="auto" when loading the model. It'll be slow as molasses, but it might just work.
8-bit quantization: Add load_in_8bit=True to your model loading. It'll sacrifice some quality, but hey, beggars can't be choosers.
Last resort: Downgrade to a smaller model. Sometimes you gotta know when to fold 'em.

Fair warning: These hacks might make your inference slower than a snail on tranquilizers. But if you're dead set on using this model, it's worth a shot. Good luck!

Apiphine

Oct 20, 2024

Anybody use an external hard drive to run this?

Malini

Oct 23, 2024

•

edited Oct 23, 2024

Hi,

I have tried to run the inference code given here on AWS p3dn.24xlarge and p4de.24xlarge facing an space error.
But facing issues
OSError: [Errno 28] No space left on device

specs of the instance

Have tried the following the tips given here https://discuss.huggingface.co/t/no-space-left-on-device-when-downloading-a-large-model-for-the-sagemaker-training-job/43643

Any help is appreciated, please let me know if I am missing something
Thanks in advance

boxin-wbx

NVIDIA org Oct 23, 2024

Hi Malini,

Thank you for your interests in our model.

From the first screenshot, it shows that your home directory has not enough disk space. Running NVLM-D requires at least around 200GB of disk space.

Your second screenshot suggests that it is likely that you have separate disks.

Please try run the following command on the disk with 1000GB.

Install Git Large File Storage (LFS) by running:

git lfs install

Clone the NVLM-D repository using:

git clone https://huggingface.co/nvidia/NVLM-D-72B

(say this clones the model into your local path "path/to/NVLM-D-72B")

Load the model with your local path

path = "path/to/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map=device_map).eval()

Let me know if you encounter any issues, and I’d be happy to assist further!

Best,
Boxin

Malini

Oct 23, 2024

•

edited Oct 23, 2024

Update!
Thank you @boxin-wbx ! Issue was fixed when i installed the LFS and cloned the repo.
I did need to run the device_map() to run the inference on text.
When I run the inference on images, am getting a memory error.Sharing the screenshot below

boxin-wbx

NVIDIA org Oct 23, 2024

We haven't tested on V100 before. But a node with 2 H100 / A100 GPUs (each with 80GB of memory) should work.

Malini

Oct 24, 2024

Thanks @boxin-wbx .It worked on a ml.p4de.24xlarge instance. Appreciate your inputs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment