- 
	
	
	iVideoGPT: Interactive VideoGPTs are Scalable World ModelsPaper • 2405.15223 • Published • 17
- 
	
	
	Meteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsPaper • 2405.15574 • Published • 55
- 
	
	
	An Introduction to Vision-Language ModelingPaper • 2405.17247 • Published • 90
- 
	
	
	Matryoshka Multimodal ModelsPaper • 2405.17430 • Published • 34
Collections
Discover the best community collections!
Collections including paper arxiv:2506.23044 
						
					
				- 
	
	
	OmniGen2: Exploration to Advanced Multimodal GenerationPaper • 2506.18871 • Published • 77
- 
	
	
	OmniGen: Unified Image GenerationPaper • 2409.11340 • Published • 115
- 
	
	
	Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and GenerationPaper • 2502.05415 • Published • 22
- 
	
	
	Show-o: One Single Transformer to Unify Multimodal Understanding and GenerationPaper • 2408.12528 • Published • 51
- 
	
	
	  yandex/stable-diffusion-3.5-medium-alchemistText-to-Image • Updated • 76 • 6
- 
	
	
	Ovis-U1 Technical ReportPaper • 2506.23044 • Published • 62
- 
	
	
	FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion ModelPaper • 2507.01953 • Published • 19
- 
	
	
	LongAnimation: Long Animation Generation with Dynamic Global-Local MemoryPaper • 2507.01945 • Published • 78
- 
	
	
	Can Large Language Models Understand Context?Paper • 2402.00858 • Published • 23
- 
	
	
	OLMo: Accelerating the Science of Language ModelsPaper • 2402.00838 • Published • 84
- 
	
	
	Self-Rewarding Language ModelsPaper • 2401.10020 • Published • 151
- 
	
	
	SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual SimilarityPaper • 2401.17072 • Published • 25
- 
	
	
	EVA-CLIP-18B: Scaling CLIP to 18 Billion ParametersPaper • 2402.04252 • Published • 28
- 
	
	
	Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation ModelsPaper • 2402.03749 • Published • 14
- 
	
	
	ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingPaper • 2402.04615 • Published • 44
- 
	
	
	EfficientViT-SAM: Accelerated Segment Anything Model Without Performance LossPaper • 2402.05008 • Published • 23
- 
	
	
	Unified Multimodal Understanding and Generation Models: Advances, Challenges, and OpportunitiesPaper • 2505.02567 • Published • 80
- 
	
	
	OmniGen2: Exploration to Advanced Multimodal GenerationPaper • 2506.18871 • Published • 77
- 
	
	
	UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and GenerationPaper • 2506.17202 • Published • 10
- 
	
	
	ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationPaper • 2506.18095 • Published • 66
- 
	
	
	DocLLM: A layout-aware generative language model for multimodal document understandingPaper • 2401.00908 • Published • 189
- 
	
	
	COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-TrainingPaper • 2401.00849 • Published • 17
- 
	
	
	LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsPaper • 2311.05437 • Published • 51
- 
	
	
	LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and EditingPaper • 2311.00571 • Published • 43
- 
	
	
	iVideoGPT: Interactive VideoGPTs are Scalable World ModelsPaper • 2405.15223 • Published • 17
- 
	
	
	Meteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsPaper • 2405.15574 • Published • 55
- 
	
	
	An Introduction to Vision-Language ModelingPaper • 2405.17247 • Published • 90
- 
	
	
	Matryoshka Multimodal ModelsPaper • 2405.17430 • Published • 34
- 
	
	
	EVA-CLIP-18B: Scaling CLIP to 18 Billion ParametersPaper • 2402.04252 • Published • 28
- 
	
	
	Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation ModelsPaper • 2402.03749 • Published • 14
- 
	
	
	ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingPaper • 2402.04615 • Published • 44
- 
	
	
	EfficientViT-SAM: Accelerated Segment Anything Model Without Performance LossPaper • 2402.05008 • Published • 23
- 
	
	
	OmniGen2: Exploration to Advanced Multimodal GenerationPaper • 2506.18871 • Published • 77
- 
	
	
	OmniGen: Unified Image GenerationPaper • 2409.11340 • Published • 115
- 
	
	
	Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and GenerationPaper • 2502.05415 • Published • 22
- 
	
	
	Show-o: One Single Transformer to Unify Multimodal Understanding and GenerationPaper • 2408.12528 • Published • 51
- 
	
	
	  yandex/stable-diffusion-3.5-medium-alchemistText-to-Image • Updated • 76 • 6
- 
	
	
	Ovis-U1 Technical ReportPaper • 2506.23044 • Published • 62
- 
	
	
	FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion ModelPaper • 2507.01953 • Published • 19
- 
	
	
	LongAnimation: Long Animation Generation with Dynamic Global-Local MemoryPaper • 2507.01945 • Published • 78
- 
	
	
	Unified Multimodal Understanding and Generation Models: Advances, Challenges, and OpportunitiesPaper • 2505.02567 • Published • 80
- 
	
	
	OmniGen2: Exploration to Advanced Multimodal GenerationPaper • 2506.18871 • Published • 77
- 
	
	
	UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and GenerationPaper • 2506.17202 • Published • 10
- 
	
	
	ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationPaper • 2506.18095 • Published • 66
- 
	
	
	Can Large Language Models Understand Context?Paper • 2402.00858 • Published • 23
- 
	
	
	OLMo: Accelerating the Science of Language ModelsPaper • 2402.00838 • Published • 84
- 
	
	
	Self-Rewarding Language ModelsPaper • 2401.10020 • Published • 151
- 
	
	
	SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual SimilarityPaper • 2401.17072 • Published • 25
- 
	
	
	DocLLM: A layout-aware generative language model for multimodal document understandingPaper • 2401.00908 • Published • 189
- 
	
	
	COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-TrainingPaper • 2401.00849 • Published • 17
- 
	
	
	LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsPaper • 2311.05437 • Published • 51
- 
	
	
	LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and EditingPaper • 2311.00571 • Published • 43
 
							
							 
							
							 
				 
							
							