CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR
The CleanMel model was presented in the paper CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR.
CleanMel is a single-channel Mel-spectrogram denoising and dereverberation network designed to improve both speech quality and automatic speech recognition (ASR) performance. It takes noisy and reverberant microphone recordings as input and predicts the corresponding clean Mel-spectrogram. This enhanced Mel-spectrogram can then be either transformed to a speech waveform with a neural vocoder or directly used for ASR.
The proposed network employs interleaved cross-band and narrow-band processing in the Mel-frequency domain, which allows it to learn full-band spectral patterns and narrow-band properties of signals, respectively. A key advantage of Mel-spectrogram enhancement, compared to linear-frequency domain or time-domain speech enhancement, is that Mel-frequency presents speech in a more compact way, making it easier to learn. This compactness benefits both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate significant improvements.
- π Paper
- π Project Page
- π» GitHub Repository
- π Hugging Face Demo
Overview π

CleanMel enhances logMel spectrograms for improved speech quality and ASR performance. Outputs are compatible with:
- ποΈ Vocoders for enhanced waveforms
- π€ ASR systems for transcription
Quick Start β‘
Environment Setup
conda create -n CleanMel python=3.10.14
conda activate CleanMel
pip install -r requirements.txt
Inference
Pretrained models can be downloaded manually from the WestlakeAudioLab/CleanMel repository, or automatically with the help of the huggingface-hub package.
# Inference with pretrained models from huggingface
## Offline example (offline_CleanMel_S_mask)
cd shell
bash inference.sh 0, offline S mask huggingface
## Online example (online_CleanMel_S_map)
bash inference.sh 0, online S map huggingface
# Inference with local pretrained models
cd shell
bash inference.sh 0, offline S mask
## Online example (online_CleanMel_S_map)
bash inference.sh 0, online S map
Custom Input: Modify speech_folder in inference.sh
Output: Results saved to output_folder (default to ./my_output)
Performance π
Speech Enhancement


ASR Accuracy

π‘ ASR implementation details are available in the asr_infer branch of the GitHub repository.
Citation π
If you find CleanMel useful, please cite our work:
@ARTICLE{11097896,
author={Shao, Nian and Zhou, Rui and Wang, Pengyu and Li, Xian and Fang, Ying and Yang, Yujie and Li, Xiaofei},
journal={IEEE Transactions on Audio, Speech and Language Processing},
title={CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR},
year={2025},
volume={},
number={},
pages={1-13},
doi={10.1109/TASLPRO.2025.3592333}}
}