Papers
arxiv:2511.01163

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Published on Nov 3
· Submitted by Yongyuan Liang on Nov 4
Authors:
,
,
,
,
,
,
,

Abstract

ROVER is a benchmark that evaluates reciprocal cross-modal reasoning in unified multimodal models, showing that cross-modal interactions significantly impact visual generation quality and that models struggle with symbolic reasoning tasks.

AI-generated summary

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

Community

Paper submitter

ROVER evaluates UMMs through reciprocal cross-modal reasoning: ROVER-IG requires generating images with language-augmented reasoning, while ROVER-TG requires generating text answers with visually-augmented reasoning.
https://roverbench.github.io/
overview

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.01163 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.01163 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.01163 in a Space README.md to link it from this page.

Collections including this paper 1