Oliver Wei
Oliver2021
AI & ML interests
None yet
Recent Activity
liked
a model
5 days ago
pyannote/speaker-diarization-3.1
liked
a model
5 days ago
openai/whisper-large-v3
liked
a dataset
8 days ago
vyokky/GUI-360
Organizations
None yet
Image-gen
MLLM
LLM understanding
MM-EVAL
-
MMRA: A Benchmark for Multi-granularity Multi-image Relational Association
Paper • 2407.17379 • Published • 3 -
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Paper • 2409.12959 • Published • 38 -
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Paper • 2505.16459 • Published • 45 -
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Paper • 2505.23359 • Published • 39
MMLM
-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Paper • 2403.12267 • Published -
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 43
Video-gen
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 102 -
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper • 2506.13759 • Published • 43 -
Video models are zero-shot learners and reasoners
Paper • 2509.20328 • Published • 96
Agent
-
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
Paper • 2503.10970 • Published • 18 -
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper • 2504.01990 • Published • 300 -
Grounding Computer Use Agents on Human Demonstrations
Paper • 2511.07332 • Published • 99
Long context
-
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Paper • 2502.08910 • Published • 148 -
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Paper • 2502.13063 • Published • 72 -
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Paper • 2502.11089 • Published • 165 -
LLM Pretraining with Continuous Concepts
Paper • 2502.08524 • Published • 29
RAG
reasoning
-
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Paper • 2501.04686 • Published • 53 -
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Paper • 2501.09686 • Published • 41 -
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 130 -
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
Paper • 2502.19400 • Published • 48
VLA
Video-gen
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 102 -
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper • 2506.13759 • Published • 43 -
Video models are zero-shot learners and reasoners
Paper • 2509.20328 • Published • 96
Image-gen
Agent
-
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
Paper • 2503.10970 • Published • 18 -
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper • 2504.01990 • Published • 300 -
Grounding Computer Use Agents on Human Demonstrations
Paper • 2511.07332 • Published • 99
MLLM
Long context
-
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Paper • 2502.08910 • Published • 148 -
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Paper • 2502.13063 • Published • 72 -
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Paper • 2502.11089 • Published • 165 -
LLM Pretraining with Continuous Concepts
Paper • 2502.08524 • Published • 29
LLM understanding
RAG
MM-EVAL
-
MMRA: A Benchmark for Multi-granularity Multi-image Relational Association
Paper • 2407.17379 • Published • 3 -
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Paper • 2409.12959 • Published • 38 -
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Paper • 2505.16459 • Published • 45 -
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Paper • 2505.23359 • Published • 39
reasoning
-
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Paper • 2501.04686 • Published • 53 -
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Paper • 2501.09686 • Published • 41 -
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 130 -
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
Paper • 2502.19400 • Published • 48
MMLM
-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Paper • 2403.12267 • Published -
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 43