view article Article LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR By lightonai and 2 others • 7 days ago • 52
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation Paper • 2510.00515 • Published 29 days ago • 39
jina-embeddings-v4 Collection Universal Embeddings for Multimodal Multilingual Retrieval • 10 items • Updated Sep 2 • 2
Lost in Embeddings: Information Loss in Vision-Language Models Paper • 2509.11986 • Published Sep 15 • 27
Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation Paper • 2509.10058 • Published Sep 12 • 11
STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer Paper • 2508.10893 • Published Aug 14 • 31
MolmoAct: Action Reasoning Models that can Reason in Space Paper • 2508.07917 • Published Aug 11 • 43
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off Paper • 2508.04825 • Published Aug 6 • 57
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models Paper • 2506.15681 • Published Jun 18 • 39
Scaling Properties of Diffusion Models for Perceptual Tasks Paper • 2411.08034 • Published Nov 12, 2024 • 13
Cosmos-Tokenizer Collection A suite of image and video tokenizers • 13 items • Updated 9 days ago • 41
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published Oct 2, 2024 • 41
view article Article Welcome FalconMamba: The first strong attention-free 7B model Aug 12, 2024 • 113
MobileNetV4 pretrained weights Collection Weights for MobileNet-V4 pretrained in timm • 17 items • Updated Sep 19 • 20
DiTFastAttn: Attention Compression for Diffusion Transformer Models Paper • 2406.08552 • Published Jun 12, 2024 • 25
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Paper • 2405.18392 • Published May 28, 2024 • 12