ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
Abstract
ARC-Chapter is a large-scale video chaptering model that improves performance through extensive training data and a new evaluation metric, demonstrating superior results and transferability.
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
Community
ARC-Chapter is a large-scale model from Tencent ARC Lab for deep video understanding and structured chapter generation. ARC-Chapter automatically analyzes videos that have a clear narrative or semantic structure, segmenting them into meaningful chapters, identifying precise timestamps, and generating summaries for each part.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction (2025)
- Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization (2025)
- SummDiff: Generative Modeling of Video Summarization with Diffusion (2025)
- K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding (2025)
- VC4VG: Optimizing Video Captions for Text-to-Video Generation (2025)
- MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding (2025)
- From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper