CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

The CleanMel model was presented in the paper CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR.

CleanMel is a single-channel Mel-spectrogram denoising and dereverberation network designed to improve both speech quality and automatic speech recognition (ASR) performance. It takes noisy and reverberant microphone recordings as input and predicts the corresponding clean Mel-spectrogram. This enhanced Mel-spectrogram can then be either transformed to a speech waveform with a neural vocoder or directly used for ASR.

The proposed network employs interleaved cross-band and narrow-band processing in the Mel-frequency domain, which allows it to learn full-band spectral patterns and narrow-band properties of signals, respectively. A key advantage of Mel-spectrogram enhancement, compared to linear-frequency domain or time-domain speech enhancement, is that Mel-frequency presents speech in a more compact way, making it easier to learn. This compactness benefits both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate significant improvements.

Overview 🚀

CleanMel Architecture

CleanMel enhances logMel spectrograms for improved speech quality and ASR performance. Outputs are compatible with:

🎙️ Vocoders for enhanced waveforms
🤖 ASR systems for transcription

Quick Start ⚡

Environment Setup

conda create -n CleanMel python=3.10.14
conda activate CleanMel
pip install -r requirements.txt

Inference

Pretrained models can be downloaded manually from the WestlakeAudioLab/CleanMel repository, or automatically with the help of the huggingface-hub package.

# Inference with pretrained models from huggingface
## Offline example (offline_CleanMel_S_mask)
cd shell
bash inference.sh 0, offline S mask huggingface

## Online example (online_CleanMel_S_map)
bash inference.sh 0, online S map huggingface

# Inference with local pretrained models
cd shell
bash inference.sh 0, offline S mask

## Online example (online_CleanMel_S_map)
bash inference.sh 0, online S map

Custom Input: Modify speech_folder in inference.sh

Output: Results saved to output_folder (default to ./my_output)

Performance 📊

Speech Enhancement

DNSMOS Performance

PESQ Performance

ASR Accuracy

ASR Performance

💡 ASR implementation details are available in the asr_infer branch of the GitHub repository.

Citation 📝

If you find CleanMel useful, please cite our work:

@ARTICLE{11097896,
  author={Shao, Nian and Zhou, Rui and Wang, Pengyu and Li, Xian and Fang, Ying and Yang, Yujie and Li, Xiaofei},
  journal={IEEE Transactions on Audio, Speech and Language Processing}, 
  title={CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR}, 
  year={2025},
  volume={},
  number={},
  pages={1-13},
  doi={10.1109/TASLPRO.2025.3592333}}
}

Acknowledgement 🙏

Built using NBSS template
Vocoder implementation from Vocos

Downloads last month: -; Downloads are not tracked for this model. How to track