Papers
arxiv:2510.13251

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Published on Oct 15
· Submitted by Taekyung Kim on Oct 27
Authors:

Abstract

Video Large Language Models (VideoLLMs) perform video question answering by initiating temporal reasoning through cross-frame interactions, followed by video-language integration, and generate answers using effective information pathways while suppressing unnecessary attention edges.

AI-generated summary

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

Community

Paper author Paper submitter
edited about 6 hours ago

TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways.

Summary of our findings on VideoLLMs' information flow:

(a) Temporal reasoning begins with cross-frame interactions within video tokens at early-middle layers [green], followed by video-language integration into temporal keywords in the question [purple]. This information is conveyed to the last token at middle-late layers [orange], where answer generation occurs [yellow].

(b) These effective pathways are identified via Attention Knockout, which disconnects attention pairs and tracks the drop in probability of the final answer to quantify their impact.

(c) Layer-wise answer probability rises immediately after video-language integration, indicating that the model is ready to predict correct answers after the middle layers.

Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT.

map-the-flow-teaser

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.13251 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.13251 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.13251 in a Space README.md to link it from this page.

Collections including this paper 1