areffarhadi commited on
Commit
1a56028
·
verified ·
1 Parent(s): 8165ea0

Initial upload

Browse files
Files changed (3) hide show
  1. README.md +188 -0
  2. avg_model.pt +3 -0
  3. config.yaml +84 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - speaker-verification
5
+ - speaker-embedding
6
+ - cross-lingual
7
+ - multilingual
8
+ - wespeaker
9
+ - resnet
10
+ - pytorch
11
+ datasets:
12
+ - voxblink2
13
+ - voxceleb2
14
+ - tidyvoicex
15
+ metrics:
16
+ - eer
17
+ - mindcf
18
+ ---
19
+
20
+ # TidyVoice2026 Baseline: SimAM-ResNet34 Speaker Verification Model
21
+
22
+ ## Model Description
23
+
24
+ This is the baseline model for the **TidyVoice Challenge: Cross-Lingual Speaker Verification** at Interspeech 2026. The model addresses the critical problem of speaker verification under language mismatch, where system performance degrades significantly when speakers use different languages.
25
+
26
+ ### Architecture
27
+
28
+ - **Model**: SimAM-ResNet34 with Attentive Statistical Pooling (ASP)
29
+ - **Embedding Dimension**: 256
30
+ - **Input**: 80-dimensional log Mel-filterbank features
31
+ - **Sample Rate**: 16 kHz
32
+
33
+ ### Training
34
+
35
+ The model is:
36
+ 1. **Pretrained** on VoxBlink2 and VoxCeleb2 datasets
37
+ 2. **Fine-tuned** on the TidyVoiceX training set using large-margin training
38
+
39
+ ### Performance
40
+
41
+ The baseline achieves the following performance on the TidyVoice development set:
42
+
43
+ | Architecture | Pretraining Data | Fine-tuning Data | EER (%) | MinDCF |
44
+ |:-------------|:----------------|:----------------|:-------:|:------:|
45
+ | SimAM-ResNet34 | VoxBlink2 + VoxCeleb2 | TidyVoiceX Train | 3.07 | 0.82 |
46
+
47
+ ## Usage
48
+
49
+ > **For TidyVoice2026 Challenge**: If you are using this model for the TidyVoice2026 Challenge, please follow the detailed instructions in the [GitHub repository README](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) for complete setup, data preparation, training, and evaluation procedures.
50
+
51
+ ### Installation
52
+
53
+ First, install WeSpeaker:
54
+
55
+ ```bash
56
+ pip install git+https://github.com/wenet-e2e/wespeaker.git
57
+ ```
58
+
59
+ Or clone the repository:
60
+
61
+ ```bash
62
+ git clone https://github.com/wenet-e2e/wespeaker.git
63
+ cd wespeaker
64
+ pip install -e .
65
+ ```
66
+
67
+ ### Quick Start
68
+
69
+ #### Using WeSpeaker Python API
70
+
71
+ ```python
72
+ import wespeaker
73
+ import torch
74
+
75
+ # Load the model from Hugging Face
76
+ # Download the model files (avg_model.pt and config.yaml) to a directory
77
+ model_dir = "path/to/downloaded/model"
78
+
79
+ # Initialize the model
80
+ model = wespeaker.load_model(model_dir)
81
+ model.set_device('cuda:0') # or 'cpu'
82
+
83
+ # Extract speaker embedding from a single audio file
84
+ embedding = model.extract_embedding('audio.wav')
85
+ print(f"Embedding shape: {embedding.shape}")
86
+
87
+ # Compute similarity between two audio files
88
+ similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
89
+ print(f"Similarity score: {similarity}")
90
+
91
+ # Extract embeddings from multiple files (Kaldi format)
92
+ utt_names, embeddings = model.extract_embedding_list('wav.scp')
93
+ ```
94
+
95
+ #### Using Command Line
96
+
97
+ ```bash
98
+ # Extract embedding from a single audio file
99
+ wespeaker --task embedding \
100
+ --audio_file audio.wav \
101
+ --output_file embedding.txt \
102
+ --pretrain path/to/model/directory
103
+
104
+ # Extract embeddings from wav.scp (Kaldi format)
105
+ wespeaker --task embedding_kaldi \
106
+ --wav_scp wav.scp \
107
+ --output_file embeddings.ark \
108
+ --pretrain path/to/model/directory
109
+
110
+ # Compute similarity between two audio files
111
+ wespeaker --task similarity \
112
+ --audio_file audio1.wav \
113
+ --audio_file2 audio2.wav \
114
+ --pretrain path/to/model/directory
115
+ ```
116
+
117
+ #### Using WeSpeaker Training Scripts
118
+
119
+ If you're using the WeSpeaker training framework, you can load the model checkpoint directly:
120
+
121
+ ```python
122
+ from wespeaker.utils.checkpoint import load_checkpoint
123
+ from wespeaker.models.speaker_model import get_speaker_model
124
+ import yaml
125
+
126
+ # Load config
127
+ with open('config.yaml', 'r') as f:
128
+ configs = yaml.safe_load(f)
129
+
130
+ # Initialize model
131
+ model = get_speaker_model(configs['model'])(**configs['model_args'])
132
+
133
+ # Load checkpoint
134
+ load_checkpoint(model, 'avg_model.pt')
135
+
136
+ # Set to evaluation mode
137
+ model.eval()
138
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
139
+ model.to(device)
140
+
141
+ # Extract embeddings (see examples/tidyvocie/README.md for full pipeline)
142
+ ```
143
+
144
+ ### Model Files
145
+
146
+ The model repository should contain:
147
+ - `avg_model.pt`: The averaged model checkpoint (PyTorch format)
148
+ - `config.yaml`: Model configuration file
149
+
150
+ **Note**: When using WeSpeaker's `load_model()` function, ensure the model directory contains both `avg_model.pt` and `config.yaml` files.
151
+
152
+ ## Dataset
153
+
154
+ This model is trained and evaluated on:
155
+ - **TidyVoiceX**: A large-scale, multilingual corpus derived from Mozilla Common Voice
156
+ - Over 4,474 speakers across 40 languages
157
+ - Approximately 321,711 utterances totaling 457 hours
158
+ - Designed to isolate the effect of language switching
159
+
160
+ For more information about the dataset and challenge, visit: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)
161
+
162
+ ## Citation
163
+
164
+ If you use this model in your research, please cite:
165
+
166
+ ```bibtex
167
+ @inproceedings{tidyvoice2026,
168
+ title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
169
+ author={...},
170
+ booktitle={Interspeech},
171
+ year={2026}
172
+ }
173
+ ```
174
+
175
+ ## Additional Resources
176
+
177
+ - **TidyVoice2026 Challenge README**: [Complete setup and usage guide](https://github.com/areffarhadi/wespeaker/blob/master/examples/tidyvocie/README.md) - Follow this for detailed instructions on using this model for the TidyVoice2026 Challenge
178
+ - **GitHub Repository**: [WeSpeaker TidyVoice Baseline](https://github.com/wenet-e2e/wespeaker/tree/master/examples/tidyvocie)
179
+ - **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)
180
+ - **WeSpeaker Documentation**: [https://github.com/wenet-e2e/wespeaker](https://github.com/wenet-e2e/wespeaker)
181
+
182
+ ## Contact
183
+
184
+ For questions about the challenge or this baseline:
185
+ - **Aref Farhadipour**: [email protected]
186
+ - **Challenge Website**: [https://tidyvoice2026.github.io](https://tidyvoice2026.github.io)
187
+
188
+
avg_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8fdfd9a657489ad467d3a403c617a9ddfb028204e77c39e1303c79782f13f3a
3
+ size 104756586
config.yaml ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data_type: shard
2
+ dataloader_args:
3
+ batch_size: 24
4
+ drop_last: true
5
+ num_workers: 16
6
+ pin_memory: false
7
+ prefetch_factor: 8
8
+ dataset_args:
9
+ aug_prob: 0.3
10
+ fbank_args:
11
+ dither: 1.0
12
+ frame_length: 25
13
+ frame_shift: 10
14
+ num_mel_bins: 80
15
+ filter: true
16
+ filter_args:
17
+ max_num_frames: 800
18
+ min_num_frames: 200
19
+ num_frms: 600
20
+ resample_rate: 16000
21
+ sample_num_per_epoch: 0
22
+ shuffle: true
23
+ shuffle_args:
24
+ shuffle_size: 2500
25
+ spec_aug: false
26
+ spec_aug_args:
27
+ max_f: 8
28
+ max_t: 10
29
+ num_f_mask: 1
30
+ num_t_mask: 1
31
+ prob: 0.6
32
+ speed_perturb: false
33
+ do_lm: true
34
+ enable_amp: false
35
+ exp_dir: exp/samresnet34_voxblink_ft_tidy
36
+ gpus:
37
+ - 4
38
+ - 5
39
+ log_batch_interval: 100
40
+ loss: CrossEntropyLoss
41
+ loss_args: {}
42
+ margin_scheduler: MarginScheduler
43
+ margin_update:
44
+ epoch_iter: 5463
45
+ final_margin: 0.3
46
+ fix_start_epoch: 3
47
+ increase_start_epoch: 0
48
+ increase_type: linear
49
+ initial_margin: 0.0
50
+ update_margin: true
51
+ model: SimAM_ResNet34_ASP
52
+ model_args:
53
+ embed_dim: 256
54
+ model_init: tidy/avg_model.pt
55
+ noise_data: data/musan/lmdb
56
+ num_avg: 1
57
+ num_epochs: 7
58
+ optimizer: SGD
59
+ optimizer_args:
60
+ lr: 5.0e-05
61
+ momentum: 0.9
62
+ nesterov: true
63
+ weight_decay: 0.0001
64
+ projection_args:
65
+ do_lm: true
66
+ easy_margin: false
67
+ embed_dim: 256
68
+ num_class: 3666
69
+ project_type: arc_margin
70
+ scale: 32.0
71
+ reverb_data: data/rirs/lmdb
72
+ save_epoch_interval: 1
73
+ scheduler: ExponentialDecrease
74
+ scheduler_args:
75
+ epoch_iter: 5463
76
+ final_lr: 1.0e-05
77
+ initial_lr: 5.0e-05
78
+ num_epochs: 7
79
+ scale_ratio: 0.75
80
+ warm_from_zero: false
81
+ warm_up_epoch: 0
82
+ seed: 42
83
+ train_data: /local/scratch/arfarh/wespeaker/wespeaker/examples/voxceleb/v2/data/vox2_dev/shard.list
84
+ train_label: /local/scratch/arfarh/wespeaker/wespeaker/examples/voxceleb/v2/data/vox2_dev/utt2spk