--- library_name: nanovlm license: mit pipeline_tag: image-text-to-text tags: - vision-language - multimodal - research - twin-tower --- **Twin-Tower VLM** is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation. ## Architecture The twin-tower architecture consists of: 1. **Vision Tower**: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts 2. **Language Tower**: Frozen language model that receives vision contexts and generates text ## Key Features - **Twin-Tower Design**: Separate processing of vision and language with per-layer context integration - **Frozen Language Tower**: Language model parameters are frozen, gradients flow through vision contexts - **Per-Layer Contexts**: Vision tower generates contexts for each language model layer - **Efficient Training**: Only vision tower components are trainable ## Usage ```python from twin_tower import VisionLanguageTwinTowerModel from config import VLMConfig # Load the model cfg = VLMConfig() model = VisionLanguageTwinTowerModel.from_pretrained(cfg) # Generate text from image from PIL import Image image = Image.open("your_image.jpg") result = model.generate_from_text("What is in this image?", image) print(result) ``` ## Model Details - **Base Model**: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-3000 - **Architecture**: Twin-Tower VLM - **Vision Encoder**: SigLIP-based - **Language Model**: SmolLM2-based - **Parameters**: ~230M total (vision tower trainable, language tower frozen) For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.