Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.

Architecture

The twin-tower architecture consists of:

  1. Vision Tower: Processes images through vision encoder โ†’ modality projector โ†’ decoder layers to create per-layer contexts
  2. Language Tower: Frozen language model that receives vision contexts and generates text

Key Features

  • Twin-Tower Design: Separate processing of vision and language with per-layer context integration
  • Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
  • Per-Layer Contexts: Vision tower generates contexts for each language model layer
  • Efficient Training: Only vision tower components are trainable

Usage

from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig

# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)

# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)

Model Details

  • Base Model: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-1500
  • Architecture: Twin-Tower VLM
  • Vision Encoder: SigLIP-based
  • Language Model: SmolLM2-based
  • Parameters: ~230M total (vision tower trainable, language tower frozen)

For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.

Downloads last month
10
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including patrickamadeus/nanoVLM-230M-8k-twin-maxxing-1500