Papers
arxiv:2401.07803

Uncovering the Full Potential of Visual Grounding Methods in VQA

Published on Jan 15, 2024
Authors:
,

Abstract

Current evaluation schemes for Visual Grounding methods in Visual Question Answering are flawed due to the assumption of accurate and relevant visual information, which can lead to inaccurate assessments of their effectiveness.

AI-generated summary

Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided on GitHub.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.07803 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.07803 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.