--- license: apache-2.0 tags: - speaker-verification - speaker-embedding - cross-lingual - multilingual - wespeaker - resnet - pytorch datasets: - voxblink2 - voxceleb2 - tidyvoicex metrics: - eer - mindcf --- # TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model ## Model Description This is the baseline model for the **TidyVoice Challenge: Cross-Lingual Speaker Verification** at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages. ### Architecture - **Model**: SimAM-ResNet34 with Attentive Statistical Pooling (ASP) - **Embedding Dimension**: 256 - **Input**: 80-dimensional log Mel-filterbank features - **Sample Rate**: 16 kHz ### Training The model is: 1. **Pretrained** on VoxBlink2 and VoxCeleb2 datasets 2. **Fine-tuned** on the TidyVoiceX training set using large-margin training ### Performance The baseline achieves the following performance on the TidyVoice development set: | Architecture | Pretraining Data | Fine-tuning Data | EER (%) | MinDCF | |:-------------|:----------------|:----------------|:-------:|:------:| | SimAM-ResNet34 | VoxBlink2 + VoxCeleb2 | TidyVoiceX Train | 3.07 | 0.82 | ## Usage > **For TidyVoice2026 Challenge**: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the [GitHub repository README](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) for complete setup, data preparation, training, and evaluation procedures. ## Additional Resources - **TidyVoice2026 Challenge README**: [Complete setup and usage guide](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge - **GitHub Repository**: [WeSpeaker TidyVoice Baseline](https://github.com/areffarhadi/wespeaker/tree/master/examples/tidyvocie) - **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io) ### Installation First, install WeSpeaker: ```bash pip install git+https://github.com/wenet-e2e/wespeaker.git ``` Or clone the repository: ```bash git clone https://github.com/wenet-e2e/wespeaker.git cd wespeaker pip install -e . ``` ### Quick Start #### Using WeSpeaker Python API ```python import wespeaker import torch # Load the model from Hugging Face # Download the model files (avg_model.pt and config.yaml) to a directory model_dir = "path/to/downloaded/model" # Initialize the model model = wespeaker.load_model(model_dir) model.set_device('cuda:0') # or 'cpu' # Extract speaker embedding from a single audio file embedding = model.extract_embedding('audio.wav') print(f"Embedding shape: {embedding.shape}") # Compute similarity between two audio files similarity = model.compute_similarity('audio1.wav', 'audio2.wav') print(f"Similarity score: {similarity}") # Extract embeddings from multiple files (Kaldi format) utt_names, embeddings = model.extract_embedding_list('wav.scp') ``` #### Using Command Line ```bash # Extract embedding from a single audio file wespeaker --task embedding \ --audio_file audio.wav \ --output_file embedding.txt \ --pretrain path/to/model/directory # Extract embeddings from wav.scp (Kaldi format) wespeaker --task embedding_kaldi \ --wav_scp wav.scp \ --output_file embeddings.ark \ --pretrain path/to/model/directory # Compute similarity between two audio files wespeaker --task similarity \ --audio_file audio1.wav \ --audio_file2 audio2.wav \ --pretrain path/to/model/directory ``` #### Using WeSpeaker Training Scripts If you're using the WeSpeaker training framework, you can load the model checkpoint directly: ```python from wespeaker.utils.checkpoint import load_checkpoint from wespeaker.models.speaker_model import get_speaker_model import yaml # Load config with open('config.yaml', 'r') as f: configs = yaml.safe_load(f) # Initialize model model = get_speaker_model(configs['model'])(**configs['model_args']) # Load checkpoint load_checkpoint(model, 'avg_model.pt') # Set to evaluation mode model.eval() device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.to(device) # Extract embeddings (see examples/tidyvocie/README.md for full pipeline) ``` ### Model Files The model repository should contain: - `avg_model.pt`: The averaged model checkpoint (PyTorch format) - `config.yaml`: Model configuration file **Note**: When using WeSpeaker's `load_model()` function, ensure the model directory contains both `avg_model.pt` and `config.yaml` files. ## Dataset This model is trained and evaluated on: - **TidyVoiceX**: A large-scale, multilingual corpus derived from Mozilla Common Voice - Over 4,474 speakers across 40 languages - Approximately 321,711 utterances totaling 457 hours - Designed to isolate the effect of language switching For more information about the dataset and challenge, visit: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io) ## Citation