Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.

Architecture

The twin-tower architecture consists of:

Vision Tower: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
Language Tower: Frozen language model that receives vision contexts and generates text

Key Features

Twin-Tower Design: Separate processing of vision and language with per-layer context integration
Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
Per-Layer Contexts: Vision tower generates contexts for each language model layer
Efficient Training: Only vision tower components are trainable

Usage

from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig

# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)

# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)

Model Details

Base Model: patrickamadeus/nanoVLM-twin-kv-cache-sanity-1500
Architecture: Twin-Tower VLM
Vision Encoder: SigLIP-based
Language Model: SmolLM2-based
Parameters: ~230M total (vision tower trainable, language tower frozen)

For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.

Downloads last month: 8

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support