Whisper-Large v3 for Mandarin Dialect and Cantonese Classification
Model Description
This model includes the implementation of Mandarin dialect and Cantonese classification described in Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe
Github repository: https://github.com/tiantiaf0627/voxlect
The included Mandarin dialects and Cantonese are:
[
"Jiang-Huai",
"Jiao-Liao",
"Ji-Lu",
"Lan-Yin",
"Mandarin",
"Southwestern",
"Zhongyuan",
"Cantonese"
]
Northeastern and Beijing Mandarin have been merged into Mandarin due to their high degree of similarity and the limited number of speakers.
How to use this model
Download repo
git clone [email protected]:tiantiaf0627/voxlect
Install the package
conda create -n voxlect python=3.8
cd voxlect
pip install -e .
Load the model
# Load libraries
import torch
import torch.nn.functional as F
from src.model.dialect.whisper_dialect import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/voxlect-mandarin-cantonese-dialect-whisper-large-v3").to(device)
model.eval()
Prediction
# Label List
dialect_list = [
"Jiang-Huai",
"Jiao-Liao",
"Ji-Lu",
"Lan-Yin",
"Mandarin",
"Southwestern",
"Zhongyuan",
"Cantonese"
]
# Load data, here just zeros as an example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz, and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embeddings = model(data, return_feature=True)
# Probability and output
dialect_prob = F.softmax(logits, dim=1)
print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()])
For example, the Sichuan speech dialects generated from the CosyVoice2:
Using the Voxlect, we obtain the following probability
Dialect: Jiang-Huai Probability: 0.001
Dialect: Jiao-Liao Probability: 0.001
Dialect: Ji-Lu Probability: 0.009
Dialect: Lan-Yin Probability: 0.000
Dialect: Mandarin Probability: 0.002
Dialect: Southwestern Probability: 0.981 (Target dialect)
Dialect: Zhongyuan Probability: 0.006
Dialect: Yue Probability: 0.000
For example, the Tianjin speech dialects generated from the CosyVoice2:
Using the Voxlect, we obtain the following probability
Dialect: Jiang-Huai Probability: 0.002
Dialect: Jiao-Liao Probability: 0.051
Dialect: Ji-Lu Probability: 0.169 (Target dialect)
Dialect: Lan-Yin Probability: 0.001
Dialect: Mandarin Probability: 0.765
Dialect: Southwestern Probability: 0.000
Dialect: Zhongyuan Probability: 0.013
Dialect: Yue Probability: 0.000
If you are a Mandarin speaker, you will notice that the first 2 words are spoken likely to be Tianjin dialect, but after that, the generated speech sounds more like standard Mandarin, and our model captures this correctly.
Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect.
If you have any questions, please contact: Tiantian Feng ([email protected])
❌ Out-of-Scope Use
- Clinical or diagnostic applications
- Surveillance
- Privacy-invasive applications
- No commercial use
If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!
@article{feng2025voxlect,
title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe},
author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth},
journal={arXiv preprint arXiv:2508.01691},
year={2025}
}
- Downloads last month
- 26
Model tree for tiantiaf/voxlect-mandarin-cantonese-dialect-whisper-large-v3
Base model
openai/whisper-large-v3