metadata
			library_name: nanovlm
license: mit
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - research
  - twin-tower
Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.
Architecture
The twin-tower architecture consists of:
- Vision Tower: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
 - Language Tower: Frozen language model that receives vision contexts and generates text
 
Key Features
- Twin-Tower Design: Separate processing of vision and language with per-layer context integration
 - Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
 - Per-Layer Contexts: Vision tower generates contexts for each language model layer
 - Efficient Training: Only vision tower components are trainable
 
Usage
from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig
# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)
# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)
Model Details
- Base Model: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-3000
 - Architecture: Twin-Tower VLM
 - Vision Encoder: SigLIP-based
 - Language Model: SmolLM2-based
 - Parameters: ~230M total (vision tower trainable, language tower frozen)
 
For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.