- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2503.18942 
						
					
				- 
	
	
	
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper • 2503.12605 • Published • 35 - 
	
	
	
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 30 - 
	
	
	
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Paper • 2503.12271 • Published • 9 - 
	
	
	
Video-T1: Test-Time Scaling for Video Generation
Paper • 2503.18942 • Published • 90 
- 
	
	
	
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 - 
	
	
	
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 10 - 
	
	
	
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 - 
	
	
	
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10 
- 
	
	
	
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 - 
	
	
	
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 - 
	
	
	
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 - 
	
	
	
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63 
- 
	
	
	
One-Minute Video Generation with Test-Time Training
Paper • 2504.05298 • Published • 110 - 
	
	
	
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper • 2503.23307 • Published • 138 - 
	
	
	
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 158 - 
	
	
	
Antidistillation Sampling
Paper • 2504.13146 • Published • 59 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Paper • 2410.10306 • Published • 57 - 
	
	
	
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Paper • 2411.05003 • Published • 71 - 
	
	
	
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Paper • 2411.04709 • Published • 26 - 
	
	
	
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Paper • 2410.07171 • Published • 43 
- 
	
	
	
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 - 
	
	
	
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 - 
	
	
	
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 10 - 
	
	
	
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
One-Minute Video Generation with Test-Time Training
Paper • 2504.05298 • Published • 110 - 
	
	
	
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper • 2503.23307 • Published • 138 - 
	
	
	
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 158 - 
	
	
	
Antidistillation Sampling
Paper • 2504.13146 • Published • 59 
- 
	
	
	
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper • 2503.12605 • Published • 35 - 
	
	
	
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 30 - 
	
	
	
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Paper • 2503.12271 • Published • 9 - 
	
	
	
Video-T1: Test-Time Scaling for Video Generation
Paper • 2503.18942 • Published • 90 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 - 
	
	
	
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 10 - 
	
	
	
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 - 
	
	
	
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10 
- 
	
	
	
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Paper • 2410.10306 • Published • 57 - 
	
	
	
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Paper • 2411.05003 • Published • 71 - 
	
	
	
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Paper • 2411.04709 • Published • 26 - 
	
	
	
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Paper • 2410.07171 • Published • 43 
- 
	
	
	
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 - 
	
	
	
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 - 
	
	
	
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 - 
	
	
	
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63 
- 
	
	
	
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 - 
	
	
	
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 - 
	
	
	
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 10 - 
	
	
	
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13