Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2501.04003

Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

Paper • 2502.16804 • Published Feb 24
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Paper • 2503.07608 • Published Mar 10 • 23
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

Paper • 2502.09980 • Published Feb 14 • 5
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Paper • 2503.05689 • Published Mar 7 • 3

VLM 👁️👁️

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 118
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 133

SLM e Moe structure PHD tesis: SOTA e valutazione parametri

collezione di paper utili per redazione tesi 1-2-3- capitolo da valutare cambio di rotta e gestione PHD

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Paper • 2501.01257 • Published Jan 2 • 52
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2 • 44
REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Paper • 2411.13552 • Published Nov 20, 2024

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 55
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Paper • 2403.12968 • Published Mar 19, 2024 • 25
RAFT: Adapting Language Model to Domain Specific RAG

Paper • 2403.10131 • Published Mar 15, 2024 • 72
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Paper • 2403.09629 • Published Mar 14, 2024 • 78

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 81
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 47
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7 • 56
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 47
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 75

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 241
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 37
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9, 2024 • 39

Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

Paper • 2502.16804 • Published Feb 24
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Paper • 2503.07608 • Published Mar 10 • 23
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

Paper • 2502.09980 • Published Feb 14 • 5
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Paper • 2503.05689 • Published Mar 7 • 3

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 81
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 47
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7 • 56
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27

VLM 👁️👁️

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 118
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 133

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 47
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 75

SLM e Moe structure PHD tesis: SOTA e valutazione parametri

collezione di paper utili per redazione tesi 1-2-3- capitolo da valutare cambio di rotta e gestione PHD

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Paper • 2501.01257 • Published Jan 2 • 52
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2 • 44
REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Paper • 2411.13552 • Published Nov 20, 2024

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 241
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 37
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9, 2024 • 39

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 55
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Paper • 2403.12968 • Published Mar 19, 2024 • 25
RAFT: Adapting Language Model to Domain Specific RAG

Paper • 2403.10131 • Published Mar 15, 2024 • 72
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Paper • 2403.09629 • Published Mar 14, 2024 • 78

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs