areffarhadi
/

Resnet34-tidyvoiceX-ASV

+---
+license: apache-2.0
+tags:
+- speaker-verification
+- speaker-embedding
+- cross-lingual
+- multilingual
+- wespeaker
+- resnet
+- pytorch
+datasets:
+- voxblink2
+- voxceleb2
+- tidyvoicex
+metrics:
+- eer
+- mindcf
+---
+# TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model
+## Model Description
+This is the baseline model for the **TidyVoice Challenge: Cross-Lingual Speaker Verification** at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.
+### Architecture
+- **Model**: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
+- **Embedding Dimension**: 256
+- **Input**: 80-dimensional log Mel-filterbank features
+- **Sample Rate**: 16 kHz
+### Training
+The model is:
+1. **Pretrained** on VoxBlink2 and VoxCeleb2 datasets
+2. **Fine-tuned** on the TidyVoiceX training set using large-margin training
+### Performance
+The baseline achieves the following performance on the TidyVoice development set:
+| Architecture | Pretraining Data | Fine-tuning Data | EER (%) | MinDCF |
+|:-------------|:----------------|:----------------|:-------:|:------:|
+| SimAM-ResNet34 | VoxBlink2 + VoxCeleb2 | TidyVoiceX Train | 3.07 | 0.82 |
+## Usage
+> **For TidyVoice2026 Challenge**: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the [GitHub repository README](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) for complete setup, data preparation, training, and evaluation procedures.
+### Installation
+First, install WeSpeaker:
+```bash
+pip install git+https://github.com/wenet-e2e/wespeaker.git
+```
+Or clone the repository:
+```bash
+git clone https://github.com/wenet-e2e/wespeaker.git
+cd wespeaker
+pip install -e .
+```
+### Quick Start
+#### Using WeSpeaker Python API
+```python
+import wespeaker
+import torch
+# Load the model from Hugging Face
+# Download the model files (avg_model.pt and config.yaml) to a directory
+model_dir = "path/to/downloaded/model"
+# Initialize the model
+model = wespeaker.load_model(model_dir)
+model.set_device('cuda:0')  # or 'cpu'
+# Extract speaker embedding from a single audio file
+embedding = model.extract_embedding('audio.wav')
+print(f"Embedding shape: {embedding.shape}")
+# Compute similarity between two audio files
+similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
+print(f"Similarity score: {similarity}")
+# Extract embeddings from multiple files (Kaldi format)
+utt_names, embeddings = model.extract_embedding_list('wav.scp')
+```
+#### Using Command Line
+```bash
+# Extract embedding from a single audio file
+wespeaker --task embedding \
+    --audio_file audio.wav \
+    --output_file embedding.txt \
+    --pretrain path/to/model/directory
+# Extract embeddings from wav.scp (Kaldi format)
+wespeaker --task embedding_kaldi \
+    --wav_scp wav.scp \
+    --output_file embeddings.ark \
+    --pretrain path/to/model/directory
+# Compute similarity between two audio files
+wespeaker --task similarity \
+    --audio_file audio1.wav \
+    --audio_file2 audio2.wav \
+    --pretrain path/to/model/directory
+```
+#### Using WeSpeaker Training Scripts
+If you're using the WeSpeaker training framework, you can load the model checkpoint directly:
+```python
+from wespeaker.utils.checkpoint import load_checkpoint
+from wespeaker.models.speaker_model import get_speaker_model
+import yaml
+# Load config
+with open('config.yaml', 'r') as f:
+    configs = yaml.safe_load(f)
+# Initialize model
+model = get_speaker_model(configs['model'])(**configs['model_args'])
+# Load checkpoint
+load_checkpoint(model, 'avg_model.pt')
+# Set to evaluation mode
+model.eval()
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+model.to(device)
+# Extract embeddings (see examples/tidyvocie/README.md for full pipeline)
+```
+### Model Files
+The model repository should contain:
+- `avg_model.pt`: The averaged model checkpoint (PyTorch format)
+- `config.yaml`: Model configuration file
+**Note**: When using WeSpeaker's `load_model()` function, ensure the model directory contains both `avg_model.pt` and `config.yaml` files.
+## Dataset
+This model is trained and evaluated on:
+- **TidyVoiceX**: A large-scale, multilingual corpus derived from Mozilla Common Voice
+  - Over 4,474 speakers across 40 languages
+  - Approximately 321,711 utterances totaling 457 hours
+  - Designed to isolate the effect of language switching
+For more information about the dataset and challenge, visit: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@inproceedings{tidyvoice2026,
+  title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
+  author={...},
+  booktitle={Interspeech},
+  year={2026}
+}
+```
+## Additional Resources
+- **TidyVoice2026 Challenge README**: [Complete setup and usage guide](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge
+- **GitHub Repository**: [WeSpeaker TidyVoice Baseline](https://github.com/wenet-e2e/wespeaker/tree/master/examples/tidyvocie)
+- **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)
+- **WeSpeaker Documentation**: [https://github.com/wenet-e2e/wespeaker](https://github.com/wenet-e2e/wespeaker)
+## Contact
+For questions about the challenge or this baseline:
+- **Aref Farhadipour**: [email protected]
+- **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)

avg_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c8fdfd9a657489ad467d3a403c617a9ddfb028204e77c39e1303c79782f13f3a
+size 104756586

config.yaml ADDED Viewed

	@@ -0,0 +1,84 @@

+data_type: shard
+dataloader_args:
+  batch_size: 24
+  drop_last: true
+  num_workers: 16
+  pin_memory: false
+  prefetch_factor: 8
+dataset_args:
+  aug_prob: 0.3
+  fbank_args:
+    dither: 1.0
+    frame_length: 25
+    frame_shift: 10
+    num_mel_bins: 80
+  filter: true
+  filter_args:
+    max_num_frames: 800
+    min_num_frames: 200
+  num_frms: 600
+  resample_rate: 16000
+  sample_num_per_epoch: 0
+  shuffle: true
+  shuffle_args:
+    shuffle_size: 2500
+  spec_aug: false
+  spec_aug_args:
+    max_f: 8
+    max_t: 10
+    num_f_mask: 1
+    num_t_mask: 1
+    prob: 0.6
+  speed_perturb: false
+do_lm: true
+enable_amp: false
+exp_dir: exp/samresnet34_voxblink_ft_tidy
+gpus:
+- 4
+- 5
+log_batch_interval: 100
+loss: CrossEntropyLoss
+loss_args: {}
+margin_scheduler: MarginScheduler
+margin_update:
+  epoch_iter: 5463
+  final_margin: 0.3
+  fix_start_epoch: 3
+  increase_start_epoch: 0
+  increase_type: linear
+  initial_margin: 0.0
+  update_margin: true
+model: SimAM_ResNet34_ASP
+model_args:
+  embed_dim: 256
+model_init: tidy/avg_model.pt
+noise_data: data/musan/lmdb
+num_avg: 1
+num_epochs: 7
+optimizer: SGD
+optimizer_args:
+  lr: 5.0e-05
+  momentum: 0.9
+  nesterov: true
+  weight_decay: 0.0001
+projection_args:
+  do_lm: true
+  easy_margin: false
+  embed_dim: 256
+  num_class: 3666
+  project_type: arc_margin
+  scale: 32.0
+reverb_data: data/rirs/lmdb
+save_epoch_interval: 1
+scheduler: ExponentialDecrease
+scheduler_args:
+  epoch_iter: 5463
+  final_lr: 1.0e-05
+  initial_lr: 5.0e-05
+  num_epochs: 7
+  scale_ratio: 0.75
+  warm_from_zero: false
+  warm_up_epoch: 0
+seed: 42
+train_data: /local/scratch/arfarh/wespeaker/wespeaker/examples/voxceleb/v2/data/vox2_dev/shard.list
+train_label: /local/scratch/arfarh/wespeaker/wespeaker/examples/voxceleb/v2/data/vox2_dev/utt2spk