Papers
arXiv:2511.03334

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Published on Nov 5
· Submitted by wangshuai on Nov 6
#3 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

UniAVGen, a unified framework using dual Diffusion Transformers and Asymmetric Cross-Modal Interaction, enhances audio-video generation by ensuring synchronization and consistency with fewer training samples.

AI-generated summary

Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

Community

UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.
At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations:
(1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment,
(2)Face-Aware Modulation to prioritize salient facial regions during interaction,
(3)Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.
Project Page: https://mcg-nju.github.io/UniAVGen/

Paper author

The code and checkpoint will come soon.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.03334 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.03334 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.03334 in a Space README.md to link it from this page.

Collections including this paper 1