TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model

Model Description

This is the baseline model for the TidyVoice Challenge: Cross-Lingual Speaker Verification at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.

Architecture

Model: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
Embedding Dimension: 256
Input: 80-dimensional log Mel-filterbank features
Sample Rate: 16 kHz

Training

The model is:

Pretrained on VoxBlink2 and VoxCeleb2 datasets
Fine-tuned on the TidyVoiceX training set using large-margin training

Performance

The baseline achieves the following performance on the TidyVoice development set:

Architecture	Pretraining Data	Fine-tuning Data	EER (%)	MinDCF
SimAM-ResNet34	VoxBlink2 + VoxCeleb2	TidyVoiceX Train	3.07	0.82

Usage

For TidyVoice2026 Challenge: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the GitHub repository README for complete setup, data preparation, training, and evaluation procedures.

Additional Resources

TidyVoice2026 Challenge README: Complete setup and usage guide - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge
GitHub Repository: WeSpeaker TidyVoice Baseline
Challenge Website: https://tidyvoice2026.github.io

Installation

First, install WeSpeaker:

pip install git+https://github.com/wenet-e2e/wespeaker.git

Or clone the repository:

git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .

Quick Start

Using WeSpeaker Python API

import wespeaker
import torch

# Load the model from Hugging Face
# Download the model files (avg_model.pt and config.yaml) to a directory
model_dir = "path/to/downloaded/model"

# Initialize the model
model = wespeaker.load_model(model_dir)
model.set_device('cuda:0')  # or 'cpu'

# Extract speaker embedding from a single audio file
embedding = model.extract_embedding('audio.wav')
print(f"Embedding shape: {embedding.shape}")

# Compute similarity between two audio files
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
print(f"Similarity score: {similarity}")

# Extract embeddings from multiple files (Kaldi format)
utt_names, embeddings = model.extract_embedding_list('wav.scp')

Using Command Line

# Extract embedding from a single audio file
wespeaker --task embedding \
    --audio_file audio.wav \
    --output_file embedding.txt \
    --pretrain path/to/model/directory

# Extract embeddings from wav.scp (Kaldi format)
wespeaker --task embedding_kaldi \
    --wav_scp wav.scp \
    --output_file embeddings.ark \
    --pretrain path/to/model/directory

# Compute similarity between two audio files
wespeaker --task similarity \
    --audio_file audio1.wav \
    --audio_file2 audio2.wav \
    --pretrain path/to/model/directory

Using WeSpeaker Training Scripts

If you're using the WeSpeaker training framework, you can load the model checkpoint directly:

from wespeaker.utils.checkpoint import load_checkpoint
from wespeaker.models.speaker_model import get_speaker_model
import yaml

# Load config
with open('config.yaml', 'r') as f:
    configs = yaml.safe_load(f)

# Initialize model
model = get_speaker_model(configs['model'])(**configs['model_args'])

# Load checkpoint
load_checkpoint(model, 'avg_model.pt')

# Set to evaluation mode
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Extract embeddings (see examples/tidyvocie/README.md for full pipeline)

Model Files

The model repository should contain:

avg_model.pt: The averaged model checkpoint (PyTorch format)
config.yaml: Model configuration file

Note: When using WeSpeaker's load_model() function, ensure the model directory contains both avg_model.pt and config.yaml files.

Dataset

This model is trained and evaluated on:

TidyVoiceX: A large-scale, multilingual corpus derived from Mozilla Common Voice
- Over 4,474 speakers across 40 languages
- Approximately 321,711 utterances totaling 457 hours
- Designed to isolate the effect of language switching

For more information about the dataset and challenge, visit: https://tidyvoice2026.github.io

Citation

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support