VideoNSA: Native Sparse Attention Scales Video Understanding Paper • 2510.02295 • Published 29 days ago • 9
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper • 2508.18265 • Published Aug 25 • 202
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 11 items • Updated Jul 21 • 544
STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing Paper • 2506.22868 • Published Jun 28 • 5
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Paper • 2505.15045 • Published May 21 • 54
Unofficial Mamba2 for Hf Transformers Collection Just the original weights converted to be compatible with transformers. • 5 items • Updated Oct 16, 2024 • 1
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models Paper • 2502.02492 • Published Feb 4 • 66
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach Paper • 2502.03639 • Published Feb 5 • 9