TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model
Model Description
This is the baseline model for the TidyVoice Challenge: Cross-Lingual Speaker Verification at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.
Architecture
- Model: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
- Embedding Dimension: 256
- Input: 80-dimensional log Mel-filterbank features
- Sample Rate: 16 kHz
Training
The model is:
- Pretrained on VoxBlink2 and VoxCeleb2 datasets
- Fine-tuned on the TidyVoiceX training set using large-margin training
Performance
The baseline achieves the following performance on the TidyVoice development set:
| Architecture | Pretraining Data | Fine-tuning Data | EER (%) | MinDCF |
|---|---|---|---|---|
| SimAM-ResNet34 | VoxBlink2 + VoxCeleb2 | TidyVoiceX Train | 3.07 | 0.82 |
Usage
For TidyVoice2026 Challenge: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the GitHub repository README for complete setup, data preparation, training, and evaluation procedures.
Additional Resources
- TidyVoice2026 Challenge README: Complete setup and usage guide - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge
- GitHub Repository: WeSpeaker TidyVoice Baseline
- Challenge Website: https://tidyvoice2026.github.io
Installation
First, install WeSpeaker:
pip install git+https://github.com/wenet-e2e/wespeaker.git
Or clone the repository:
git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .
Quick Start
Using WeSpeaker Python API
import wespeaker
import torch
# Load the model from Hugging Face
# Download the model files (avg_model.pt and config.yaml) to a directory
model_dir = "path/to/downloaded/model"
# Initialize the model
model = wespeaker.load_model(model_dir)
model.set_device('cuda:0') # or 'cpu'
# Extract speaker embedding from a single audio file
embedding = model.extract_embedding('audio.wav')
print(f"Embedding shape: {embedding.shape}")
# Compute similarity between two audio files
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
print(f"Similarity score: {similarity}")
# Extract embeddings from multiple files (Kaldi format)
utt_names, embeddings = model.extract_embedding_list('wav.scp')
Using Command Line
# Extract embedding from a single audio file
wespeaker --task embedding \
--audio_file audio.wav \
--output_file embedding.txt \
--pretrain path/to/model/directory
# Extract embeddings from wav.scp (Kaldi format)
wespeaker --task embedding_kaldi \
--wav_scp wav.scp \
--output_file embeddings.ark \
--pretrain path/to/model/directory
# Compute similarity between two audio files
wespeaker --task similarity \
--audio_file audio1.wav \
--audio_file2 audio2.wav \
--pretrain path/to/model/directory
Using WeSpeaker Training Scripts
If you're using the WeSpeaker training framework, you can load the model checkpoint directly:
from wespeaker.utils.checkpoint import load_checkpoint
from wespeaker.models.speaker_model import get_speaker_model
import yaml
# Load config
with open('config.yaml', 'r') as f:
configs = yaml.safe_load(f)
# Initialize model
model = get_speaker_model(configs['model'])(**configs['model_args'])
# Load checkpoint
load_checkpoint(model, 'avg_model.pt')
# Set to evaluation mode
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# Extract embeddings (see examples/tidyvocie/README.md for full pipeline)
Model Files
The model repository should contain:
avg_model.pt: The averaged model checkpoint (PyTorch format)config.yaml: Model configuration file
Note: When using WeSpeaker's load_model() function, ensure the model directory contains both avg_model.pt and config.yaml files.
Dataset
This model is trained and evaluated on:
- TidyVoiceX: A large-scale, multilingual corpus derived from Mozilla Common Voice
- Over 4,474 speakers across 40 languages
- Approximately 321,711 utterances totaling 457 hours
- Designed to isolate the effect of language switching
For more information about the dataset and challenge, visit: https://tidyvoice2026.github.io
Citation
- Downloads last month
- 4