---
license: apache-2.0
tags:
- speaker-verification
- speaker-embedding
- cross-lingual
- multilingual
- wespeaker
- resnet
- pytorch
datasets:
- voxblink2
- voxceleb2
- tidyvoicex
metrics:
- eer
- mindcf
---

# TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model

## Model Description

This is the baseline model for the **TidyVoice Challenge: Cross-Lingual Speaker Verification** at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.

### Architecture

- **Model**: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
- **Embedding Dimension**: 256
- **Input**: 80-dimensional log Mel-filterbank features
- **Sample Rate**: 16 kHz

### Training

The model is:
1. **Pretrained** on VoxBlink2 and VoxCeleb2 datasets
2. **Fine-tuned** on the TidyVoiceX training set using large-margin training

### Performance

The baseline achieves the following performance on the TidyVoice development set:

| Architecture | Pretraining Data | Fine-tuning Data | EER (%) | MinDCF |
|:-------------|:----------------|:----------------|:-------:|:------:|
| SimAM-ResNet34 | VoxBlink2 + VoxCeleb2 | TidyVoiceX Train | 3.07 | 0.82 |

## Usage

> **For TidyVoice2026 Challenge**: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the [GitHub repository README](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) for complete setup, data preparation, training, and evaluation procedures.


## Additional Resources

- **TidyVoice2026 Challenge README**: [Complete setup and usage guide](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge
- **GitHub Repository**: [WeSpeaker TidyVoice Baseline](https://github.com/areffarhadi/wespeaker/tree/master/examples/tidyvocie)
- **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)


### Installation

First, install WeSpeaker:

```bash
pip install git+https://github.com/wenet-e2e/wespeaker.git
```

Or clone the repository:

```bash
git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .
```

### Quick Start

#### Using WeSpeaker Python API

```python
import wespeaker
import torch

# Load the model from Hugging Face
# Download the model files (avg_model.pt and config.yaml) to a directory
model_dir = "path/to/downloaded/model"

# Initialize the model
model = wespeaker.load_model(model_dir)
model.set_device('cuda:0')  # or 'cpu'

# Extract speaker embedding from a single audio file
embedding = model.extract_embedding('audio.wav')
print(f"Embedding shape: {embedding.shape}")

# Compute similarity between two audio files
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
print(f"Similarity score: {similarity}")

# Extract embeddings from multiple files (Kaldi format)
utt_names, embeddings = model.extract_embedding_list('wav.scp')
```

#### Using Command Line

```bash
# Extract embedding from a single audio file
wespeaker --task embedding \
    --audio_file audio.wav \
    --output_file embedding.txt \
    --pretrain path/to/model/directory

# Extract embeddings from wav.scp (Kaldi format)
wespeaker --task embedding_kaldi \
    --wav_scp wav.scp \
    --output_file embeddings.ark \
    --pretrain path/to/model/directory

# Compute similarity between two audio files
wespeaker --task similarity \
    --audio_file audio1.wav \
    --audio_file2 audio2.wav \
    --pretrain path/to/model/directory
```

#### Using WeSpeaker Training Scripts

If you're using the WeSpeaker training framework, you can load the model checkpoint directly:

```python
from wespeaker.utils.checkpoint import load_checkpoint
from wespeaker.models.speaker_model import get_speaker_model
import yaml

# Load config
with open('config.yaml', 'r') as f:
    configs = yaml.safe_load(f)

# Initialize model
model = get_speaker_model(configs['model'])(**configs['model_args'])

# Load checkpoint
load_checkpoint(model, 'avg_model.pt')

# Set to evaluation mode
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Extract embeddings (see examples/tidyvocie/README.md for full pipeline)
```

### Model Files

The model repository should contain:
- `avg_model.pt`: The averaged model checkpoint (PyTorch format)
- `config.yaml`: Model configuration file

**Note**: When using WeSpeaker's `load_model()` function, ensure the model directory contains both `avg_model.pt` and `config.yaml` files.

## Dataset

This model is trained and evaluated on:
- **TidyVoiceX**: A large-scale, multilingual corpus derived from Mozilla Common Voice
  - Over 4,474 speakers across 40 languages
  - Approximately 321,711 utterances totaling 457 hours
  - Designed to isolate the effect of language switching

For more information about the dataset and challenge, visit: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)

## Citation