 kevin1020
			's Collections
			kevin1020
			's Collections
			
			
				
				
 - Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs- 
			Paper
			 •- 
			2403.12596
			 •
			Published
				
			•- 
				11
			 
 - Groma: Localized Visual Tokenization for Grounding Multimodal Large
  Language Models- 
			Paper
			 •- 
			2404.13013
			 •
			Published
				
			•- 
				31
			 
 - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
  Dense Captioning- 
			Paper
			 •- 
			2404.16994
			 •
			Published
				
			•- 
				36
			 
 - AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
  Capability- 
			Paper
			 •- 
			2405.14129
			 •
			Published
				
			•- 
				14
			 
 - Dense Connector for MLLMs- 
			Paper
			 •- 
			2405.13800
			 •
			Published
				
			•- 
				24
			 
 - Merlin:Empowering Multimodal LLMs with Foresight Minds- 
			Paper
			 •- 
			2312.00589
			 •
			Published
				
			•- 
				27
			 
 - LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
  Understanding- 
			Paper
			 •- 
			2407.15754
			 •
			Published
				
			•- 
				20
			 
 - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models- 
			Paper
			 •- 
			2407.15841
			 •
			Published
				
			•- 
				40
			 
 - Efficient Inference of Vision Instruction-Following Models with Elastic
  Cache- 
			Paper
			 •- 
			2407.18121
			 •
			Published
				
			•- 
				17
			 
 - VideoLLaMB: Long-context Video Understanding with Recurrent Memory
  Bridges- 
			Paper
			 •- 
			2409.01071
			 •
			Published
				
			•- 
				27
			 
 - LongVLM: Efficient Long Video Understanding via Large Language Models- 
			Paper
			 •- 
			2404.03384
			 •
			Published
 - Visual Context Window Extension: A New Perspective for Long Video
  Understanding- 
			Paper
			 •- 
			2409.20018
			 •
			Published
				
			•- 
				11
			 
 - VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
  Documents- 
			Paper
			 •- 
			2410.10594
			 •
			Published
				
			•- 
				29
			 
 - VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding- 
			Paper
			 •- 
			2501.13106
			 •
			Published
				
			•- 
				90