arxiv:2503.09158

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Published on Mar 12

Authors:

Abstract

FaVChat, a Video-MLLM, enhances fine-grained facial understanding through multi-level feature extraction and Date-Efficient GRPO, achieving superior accuracy and generalization with limited data.

AI-generated summary

Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.09158 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.09158 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.09158 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.