-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 70 -
Tree Search for Language Model Agents
Paper • 2407.01476 • Published • 1 -
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper • 2401.10935 • Published • 5 -
OmniParser for Pure Vision Based GUI Agent
Paper • 2408.00203 • Published • 25
Collections
Discover the best community collections!
Collections including paper arxiv:2409.08264
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Mind2Web: Towards a Generalist Agent for the Web
Paper • 2306.06070 • Published • 19 -
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Paper • 2410.13232 • Published • 44
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 57 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40 -
On the limits of agency in agent-based models
Paper • 2409.10568 • Published • 14 -
On the Diagram of Thought
Paper • 2409.10038 • Published • 14 -
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Paper • 2409.07703 • Published • 67
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper • 2303.17651 • Published • 2 -
Automating Thought of Search: A Journey Towards Soundness and Completeness
Paper • 2408.11326 • Published • 3 -
Building Math Agents with Multi-Turn Iterative Preference Learning
Paper • 2409.02392 • Published • 16
-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Paper • 2408.11812 • Published • 6 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Agent Workflow Memory
Paper • 2409.07429 • Published • 31 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 48
-
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 207 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 83 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 88 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78
-
Can large language models explore in-context?
Paper • 2403.15371 • Published • 33 -
Advancing LLM Reasoning Generalists with Preference Trees
Paper • 2404.02078 • Published • 46 -
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 37 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 62
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 70 -
Tree Search for Language Model Agents
Paper • 2407.01476 • Published • 1 -
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper • 2401.10935 • Published • 5 -
OmniParser for Pure Vision Based GUI Agent
Paper • 2408.00203 • Published • 25
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Mind2Web: Towards a Generalist Agent for the Web
Paper • 2306.06070 • Published • 19 -
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Paper • 2410.13232 • Published • 44
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper • 2303.17651 • Published • 2 -
Automating Thought of Search: A Journey Towards Soundness and Completeness
Paper • 2408.11326 • Published • 3 -
Building Math Agents with Multi-Turn Iterative Preference Learning
Paper • 2409.02392 • Published • 16
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 57 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Paper • 2408.11812 • Published • 6 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Agent Workflow Memory
Paper • 2409.07429 • Published • 31 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 48
-
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40 -
On the limits of agency in agent-based models
Paper • 2409.10568 • Published • 14 -
On the Diagram of Thought
Paper • 2409.10038 • Published • 14 -
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Paper • 2409.07703 • Published • 67
-
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 207 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 83 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 88 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
Can large language models explore in-context?
Paper • 2403.15371 • Published • 33 -
Advancing LLM Reasoning Generalists with Preference Trees
Paper • 2404.02078 • Published • 46 -
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 37 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 62