Papers
arxiv:2503.09158

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Published on Mar 12
Authors:
,
,
,
,
,
,
,
,

Abstract

FaVChat, a Video-MLLM, enhances fine-grained facial understanding through multi-level feature extraction and Date-Efficient GRPO, achieving superior accuracy and generalization with limited data.

AI-generated summary

Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.09158 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.09158 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.09158 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.