[Twin Tower] twin-230M-8k-coco-maxxing
Collection
10 items
โข
Updated
Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.
The twin-tower architecture consists of:
from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig
# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)
# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)
For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.