nanoVLM-230M-8k-twin-maxxing-3000 / README.md

patrickamadeus

Upload twin-tower VLM using push_to_hub

ab97ab2 verified 22 days ago

preview code

raw

history blame contribute delete

1.81 kB

metadata

library_name: nanovlm
license: mit
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - research
  - twin-tower

Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.

Architecture

The twin-tower architecture consists of:

Vision Tower: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
Language Tower: Frozen language model that receives vision contexts and generates text

Key Features

Twin-Tower Design: Separate processing of vision and language with per-layer context integration
Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
Per-Layer Contexts: Vision tower generates contexts for each language model layer
Efficient Training: Only vision tower components are trainable

Usage

from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig

# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)

# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)

Model Details

Base Model: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-3000
Architecture: Twin-Tower VLM
Vision Encoder: SigLIP-based
Language Model: SmolLM2-based
Parameters: ~230M total (vision tower trainable, language tower frozen)

For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.