SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding Paper • 2408.14764 • Published Aug 27, 2024
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18, 2024 • 78
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy Paper • 2412.02210 • Published Dec 3, 2024
Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published 9 days ago • 11