Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.
Architecture
The twin-tower architecture consists of:
- Vision Tower: Processes images through vision encoder โ modality projector โ decoder layers to create per-layer contexts
- Language Tower: Frozen language model that receives vision contexts and generates text
Key Features
- Twin-Tower Design: Separate processing of vision and language with per-layer context integration
- Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
- Per-Layer Contexts: Vision tower generates contexts for each language model layer
- Efficient Training: Only vision tower components are trainable
Usage
from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig
# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)
# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)
Model Details
- Base Model: patrickamadeus/nanoVLM-twin-kv-cache-sanity-1500
- Architecture: Twin-Tower VLM
- Vision Encoder: SigLIP-based
- Language Model: SmolLM2-based
- Parameters: ~230M total (vision tower trainable, language tower frozen)
For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support