TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model

Model Description

This is the baseline model for the TidyVoice Challenge: Cross-Lingual Speaker Verification at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.

Architecture

  • Model: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
  • Embedding Dimension: 256
  • Input: 80-dimensional log Mel-filterbank features
  • Sample Rate: 16 kHz

Training

The model is:

  1. Pretrained on VoxBlink2 and VoxCeleb2 datasets
  2. Fine-tuned on the TidyVoiceX training set using large-margin training

Performance

The baseline achieves the following performance on the TidyVoice development set:

Architecture Pretraining Data Fine-tuning Data EER (%) MinDCF
SimAM-ResNet34 VoxBlink2 + VoxCeleb2 TidyVoiceX Train 3.07 0.82

Usage

For TidyVoice2026 Challenge: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the GitHub repository README for complete setup, data preparation, training, and evaluation procedures.

Additional Resources

Installation

First, install WeSpeaker:

pip install git+https://github.com/wenet-e2e/wespeaker.git

Or clone the repository:

git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .

Quick Start

Using WeSpeaker Python API

import wespeaker
import torch

# Load the model from Hugging Face
# Download the model files (avg_model.pt and config.yaml) to a directory
model_dir = "path/to/downloaded/model"

# Initialize the model
model = wespeaker.load_model(model_dir)
model.set_device('cuda:0')  # or 'cpu'

# Extract speaker embedding from a single audio file
embedding = model.extract_embedding('audio.wav')
print(f"Embedding shape: {embedding.shape}")

# Compute similarity between two audio files
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
print(f"Similarity score: {similarity}")

# Extract embeddings from multiple files (Kaldi format)
utt_names, embeddings = model.extract_embedding_list('wav.scp')

Using Command Line

# Extract embedding from a single audio file
wespeaker --task embedding \
    --audio_file audio.wav \
    --output_file embedding.txt \
    --pretrain path/to/model/directory

# Extract embeddings from wav.scp (Kaldi format)
wespeaker --task embedding_kaldi \
    --wav_scp wav.scp \
    --output_file embeddings.ark \
    --pretrain path/to/model/directory

# Compute similarity between two audio files
wespeaker --task similarity \
    --audio_file audio1.wav \
    --audio_file2 audio2.wav \
    --pretrain path/to/model/directory

Using WeSpeaker Training Scripts

If you're using the WeSpeaker training framework, you can load the model checkpoint directly:

from wespeaker.utils.checkpoint import load_checkpoint
from wespeaker.models.speaker_model import get_speaker_model
import yaml

# Load config
with open('config.yaml', 'r') as f:
    configs = yaml.safe_load(f)

# Initialize model
model = get_speaker_model(configs['model'])(**configs['model_args'])

# Load checkpoint
load_checkpoint(model, 'avg_model.pt')

# Set to evaluation mode
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Extract embeddings (see examples/tidyvocie/README.md for full pipeline)

Model Files

The model repository should contain:

  • avg_model.pt: The averaged model checkpoint (PyTorch format)
  • config.yaml: Model configuration file

Note: When using WeSpeaker's load_model() function, ensure the model directory contains both avg_model.pt and config.yaml files.

Dataset

This model is trained and evaluated on:

  • TidyVoiceX: A large-scale, multilingual corpus derived from Mozilla Common Voice
    • Over 4,474 speakers across 40 languages
    • Approximately 321,711 utterances totaling 457 hours
    • Designed to isolate the effect of language switching

For more information about the dataset and challenge, visit: https://tidyvoice2026.github.io

Citation

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support