Papers
arxiv:2508.09868

Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions

Published on Aug 13
Authors:
,
,

Abstract

Automatic speech recognition systems under domain shift show that specific modeling choices, rather than decoder architecture or model type, impact performance.

AI-generated summary

We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and novel seq2seq models, it is specific modeling choices that influence performance.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.09868 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.09868 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.09868 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.