arxiv:2509.11866

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Published on Sep 15

· Submitted by

Authors:

Abstract

Dr.V, a hierarchical framework with Dr.V-Bench and Dr.V-Agent, addresses video hallucinations through fine-grained spatial-temporal grounding and cognitive reasoning, enhancing video understanding.

AI-generated summary

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

View arXiv page View PDF Add to collection

Community

Eureka-Leo

Paper submitter Sep 16

librarian-bot

Sep 17

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11866 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.11866 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11866 in a Space README.md to link it from this page.