--- base_model: - openai/whisper-large-v3 datasets: - mozilla-foundation/common_voice_11_0 language: - zh license: cc-by-nc-4.0 metrics: - accuracy pipeline_tag: audio-classification tags: - model_hub_mixin - pytorch_model_hub_mixin - speaker_dialect_classification library_name: transformers --- # Whisper-Large v3 for Mandarin Dialect and Cantonese Classification # Model Description This model includes the implementation of Mandarin dialect and Cantonese classification described in **Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe** Github repository: https://github.com/tiantiaf0627/voxlect The included Mandarin dialects and Cantonese are: ``` [ "Jiang-Huai", "Jiao-Liao", "Ji-Lu", "Lan-Yin", "Mandarin", "Southwestern", "Zhongyuan", "Cantonese" ] ``` Northeastern and Beijing Mandarin have been merged into Mandarin due to their high degree of similarity and the limited number of speakers. # How to use this model ## Download repo ```bash git clone git@github.com:tiantiaf0627/voxlect ``` ## Install the package ```bash conda create -n voxlect python=3.8 cd voxlect pip install -e . ``` ## Load the model ```python # Load libraries import torch import torch.nn.functional as F from src.model.dialect.whisper_dialect import WhisperWrapper # Find device device = torch.device("cuda") if torch.cuda.is_available() else "cpu" # Load model from Huggingface model = WhisperWrapper.from_pretrained("tiantiaf/voxlect-mandarin-cantonese-dialect-whisper-large-v3").to(device) model.eval() ``` ## Prediction ```python # Label List dialect_list = [ "Jiang-Huai", "Jiao-Liao", "Ji-Lu", "Lan-Yin", "Mandarin", "Southwestern", "Zhongyuan", "Cantonese" ] # Load data, here just zeros as an example # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation) # So you need to prepare your audio to a maximum of 15 seconds, 16kHz, and mono channel max_audio_length = 15 * 16000 data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length] logits, embeddings = model(data, return_feature=True) # Probability and output dialect_prob = F.softmax(logits, dim=1) print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()]) ``` ### For example, the Sichuan speech dialects generated from the CosyVoice2: [Download Example](https://huggingface.co/tiantiaf/voxlect-mandarin-cantonese-dialect-whisper-large-v3/tree/main/Sichuan_1000238_4_0.wav) Using the Voxlect, we obtain the following probability ``` Dialect: Jiang-Huai Probability: 0.001 Dialect: Jiao-Liao Probability: 0.001 Dialect: Ji-Lu Probability: 0.009 Dialect: Lan-Yin Probability: 0.000 Dialect: Mandarin Probability: 0.002 Dialect: Southwestern Probability: 0.981 (Target dialect) Dialect: Zhongyuan Probability: 0.006 Dialect: Yue Probability: 0.000 ``` ### For example, the Tianjin speech dialects generated from the CosyVoice2: [Download Example](https://huggingface.co/tiantiaf/voxlect-mandarin-cantonese-dialect-whisper-large-v3/tree/main/Tianjin_1002906_0_0.wav) Using the Voxlect, we obtain the following probability ``` Dialect: Jiang-Huai Probability: 0.002 Dialect: Jiao-Liao Probability: 0.051 Dialect: Ji-Lu Probability: 0.169 (Target dialect) Dialect: Lan-Yin Probability: 0.001 Dialect: Mandarin Probability: 0.765 Dialect: Southwestern Probability: 0.000 Dialect: Zhongyuan Probability: 0.013 Dialect: Yue Probability: 0.000 ``` If you are a Mandarin speaker, you will notice that the first 2 words are spoken likely to be Tianjin dialect, but after that, the generated speech sounds more like standard Mandarin, and our model captures this correctly. ### Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect. ## If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu) ❌ **Out-of-Scope Use** - Clinical or diagnostic applications - Surveillance - Privacy-invasive applications - No commercial use #### If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition! ``` @article{feng2025voxlect, title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe}, author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth}, journal={arXiv preprint arXiv:2508.01691}, year={2025} } ```