-
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
Paper • 2412.06209 • Published -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
Optical Music Recognition of Jazz Lead Sheets
Paper • 2509.05329 • Published -
Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music
Paper • 2405.12105 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2409.01704
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 47 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121
-
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 87 -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper • 2408.06195 • Published • 73 -
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance
Paper • 2405.06682 • Published • 3
-
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
Paper • 2412.06209 • Published -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
Optical Music Recognition of Jazz Lead Sheets
Paper • 2509.05329 • Published -
Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music
Paper • 2405.12105 • Published
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 47 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121
-
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 87 -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 83 -
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper • 2408.06195 • Published • 73 -
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance
Paper • 2405.06682 • Published • 3