Abstract
Drax, a discrete flow matching framework for ASR, achieves state-of-the-art recognition accuracy with improved efficiency by constructing an audio-conditioned probability path.
Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
Community
We propose Drax, a non-autoregressive ASR model using discrete flow matching that includes an audio-conditioned intermediate distribution to better match inference dynamics.
Drax achieves accuracy comparable to state-of-the-art autoregressive models while offering better control over the accuracy-efficiency trade-off point.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech (2025)
- UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models (2025)
- From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training (2025)
- Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens (2025)
- Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing (2025)
- Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech (2025)
- TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper