nanoVLM is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.

For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M.

Usage:

Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM. Follow the install instructions and run the following code:

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained("patrickamadeus/nanoVLM-460M-vanilla-cococap-6000")

Downloads last month: 10

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including patrickamadeus/nanoVLM-460M-vanilla-cococap-6000

[Twin Tower] vanilla-460M-coco-baseline

Collection

vanilla-460M-lrV.0-lrL.5e-06-lrMP.1e-05-bs.32 • 11 items • Updated 21 days ago