---
library_name: nanovlm
license: mit
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - research
  - twin-tower
---

**Twin-Tower VLM** is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.

## Architecture

The twin-tower architecture consists of:

1. **Vision Tower**: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
2. **Language Tower**: Frozen language model that receives vision contexts and generates text

## Key Features

- **Twin-Tower Design**: Separate processing of vision and language with per-layer context integration
- **Frozen Language Tower**: Language model parameters are frozen, gradients flow through vision contexts
- **Per-Layer Contexts**: Vision tower generates contexts for each language model layer
- **Efficient Training**: Only vision tower components are trainable

## Usage

```python
from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig

# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)

# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)
```

## Model Details

- **Base Model**: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-3000
- **Architecture**: Twin-Tower VLM
- **Vision Encoder**: SigLIP-based
- **Language Model**: SmolLM2-based
- **Parameters**: ~230M total (vision tower trainable, language tower frozen)

For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.