 L-Hongbin
			's Collections
			L-Hongbin
			's Collections
			
			
		MutiModal_Paper
		
	updated
			
 
				
				
 - PUMA: Empowering Unified MLLM with Multi-granular Visual Generation- 
			Paper
			 •- 
			2410.13861
			 •
			Published
				
			•- 
				56
			 
 - JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
  Multimodal Understanding and Generation- 
			Paper
			 •- 
			2411.07975
			 •
			Published
				
			•- 
				30
			 
 - Enhancing the Reasoning Ability of Multimodal Large Language Models via
  Mixed Preference Optimization- 
			Paper
			 •- 
			2411.10442
			 •
			Published
				
			•- 
				87
			 
 - Multimodal Autoregressive Pre-training of Large Vision Encoders- 
			Paper
			 •- 
			2411.14402
			 •
			Published
				
			•- 
				46
			 
 - DINO-X: A Unified Vision Model for Open-World Object Detection and
  Understanding- 
			Paper
			 •- 
			2411.14347
			 •
			Published
				
			•- 
				15
			 
 - Large Multi-modal Models Can Interpret Features in Large Multi-modal
  Models- 
			Paper
			 •- 
			2411.14982
			 •
			Published
				
			•- 
				19
			 
 - Efficient Long Video Tokenization via Coordinated-based Patch
  Reconstruction- 
			Paper
			 •- 
			2411.14762
			 •
			Published
				
			•- 
				11
			 
 - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
  Vision-Language Negatives- 
			Paper
			 •- 
			2411.02545
			 •
			Published
				
			•- 
				1
			 
 - Hymba: A Hybrid-head Architecture for Small Language Models- 
			Paper
			 •- 
			2411.13676
			 •
			Published
				
			•- 
				45
			 
 - SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
  with Motion-Aware Memory- 
			Paper
			 •- 
			2411.11922
			 •
			Published
				
			•- 
				19
			 
 - ShowUI: One Vision-Language-Action Model for GUI Visual Agent- 
			Paper
			 •- 
			2411.17465
			 •
			Published
				
			•- 
				88
			 
 - Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
  Training-Free Acceleration- 
			Paper
			 •- 
			2411.17686
			 •
			Published
				
			•- 
				20
			 
 - DreamMix: Decoupling Object Attributes for Enhanced Editability in
  Customized Image Inpainting- 
			Paper
			 •- 
			2411.17223
			 •
			Published
				
			•- 
				7
			 
 - FINECAPTION: Compositional Image Captioning Focusing on Wherever You
  Want at Any Granularity- 
			Paper
			 •- 
			2411.15411
			 •
			Published
				
			•- 
				8
			 
 - GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
  Comprehensive Multimodal Dataset Towards General Medical AI- 
			Paper
			 •- 
			2411.14522
			 •
			Published
				
			•- 
				39
			 
 - Knowledge Transfer Across Modalities with Natural Language Supervision- 
			Paper
			 •- 
			2411.15611
			 •
			Published
				
			•- 
				17
			 
 - ChatRex: Taming Multimodal LLM for Joint Perception and Understanding- 
			Paper
			 •- 
			2411.18363
			 •
			Published
				
			•- 
				10
			 
 - EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State
  Space Duality- 
			Paper
			 •- 
			2411.15241
			 •
			Published
				
			•- 
				7
			 
 - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient- 
			Paper
			 •- 
			2411.17787
			 •
			Published
				
			•- 
				12
			 
 - On Domain-Specific Post-Training for Multimodal Large Language Models- 
			Paper
			 •- 
			2411.19930
			 •
			Published
				
			•- 
				29
			 
 - One Token to Seg Them All: Language Instructed Reasoning Segmentation in
  Videos- 
			Paper
			 •- 
			2409.19603
			 •
			Published
				
			•- 
				19
			 
 - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
  Understanding- 
			Paper
			 •- 
			2406.19389
			 •
			Published
				
			•- 
				54
			 
 - AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
  Pruning- 
			Paper
			 •- 
			2412.03248
			 •
			Published
				
			•- 
				27
			 
 - CompCap: Improving Multimodal Large Language Models with Composite
  Captions- 
			Paper
			 •- 
			2412.05243
			 •
			Published
				
			•- 
				20
			 
 - Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion- 
			Paper
			 •- 
			2412.04424
			 •
			Published
				
			•- 
				63
			 
 - POINTS1.5: Building a Vision-Language Model towards Real World
  Applications- 
			Paper
			 •- 
			2412.08443
			 •
			Published
				
			•- 
				38
			 
 - Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
  Visual Descriptions- 
			Paper
			 •- 
			2412.08737
			 •
			Published
				
			•- 
				54
			 
 - SynerGen-VL: Towards Synergistic Image Understanding and Generation with
  Vision Experts and Token Folding- 
			Paper
			 •- 
			2412.09604
			 •
			Published
				
			•- 
				38
			 
 - Learned Compression for Compressed Learning- 
			Paper
			 •- 
			2412.09405
			 •
			Published
				
			•- 
				13
			 
 - LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
  Hierarchical Window Transformer- 
			Paper
			 •- 
			2412.13871
			 •
			Published
				
			•- 
				18
			 
 - AnySat: An Earth Observation Model for Any Resolutions, Scales, and
  Modalities- 
			Paper
			 •- 
			2412.14123
			 •
			Published
				
			•- 
				11
			 
 - FastVLM: Efficient Vision Encoding for Vision Language Models- 
			Paper
			 •- 
			2412.13303
			 •
			Published
				
			•- 
				70
			 
 - Exploring Multi-Grained Concept Annotations for Multimodal Large
  Language Models- 
			Paper
			 •- 
			2412.05939
			 •
			Published
				
			•- 
				16
			 
 - Grounding Descriptions in Images informs Zero-Shot Visual Recognition- 
			Paper
			 •- 
			2412.04429
			 •
			Published
 - 
			- 
			Viewer
			 • 
	
				Updated
					
				• 
			
			2.18M
	
				•- 
					41
				
				 •- 
					2
				 
 
 - Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
  Multimodal Large Language Models- 
			Paper
			 •- 
			2501.05767
			 •
			Published
				
			•- 
				29
			 
 - QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
  Multimodal Understanding and Generation- 
			Paper
			 •- 
			2502.05178
			 •
			Published
				
			•- 
				10
			 
 - VideoRoPE: What Makes for Good Video Rotary Position Embedding?- 
			Paper
			 •- 
			2502.05173
			 •
			Published
				
			•- 
				65
			 
 - Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More- 
			Paper
			 •- 
			2502.03738
			 •
			Published
				
			•- 
				11
			 
 - InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
  Model- 
			Paper
			 •- 
			2501.12368
			 •
			Published
				
			•- 
				45
			 
 - Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
  via Vision-Guided Reinforcement Learning- 
			Paper
			 •- 
			2503.18013
			 •
			Published
				
			•- 
				20
			 
 - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos- 
			Paper
			 •- 
			2501.09781
			 •
			Published
				
			•- 
				28
			 
 - Where do Large Vision-Language Models Look at when Answering Questions?- 
			Paper
			 •- 
			2503.13891
			 •
			Published
				
			•- 
				8
			 
 - Seedream 3.0 Technical Report- 
			Paper
			 •- 
			2504.11346
			 •
			Published
				
			•- 
				70
			 
 - RL makes MLLMs see better than SFT- 
			Paper
			 •- 
			2510.16333
			 •
			Published
				
			•- 
				46