Model description

This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k-V2) is based on a HuBERT Base architecture (~95M params) [1].
It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa.

Pretraining data

  • Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech).

  • Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).

ASR fine-tuning

The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.
Fine-tuning is done for each language using the FLEURS dataset [2].
The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.

License

This model is released under the CC-by-NC 4.0 conditions.

Results

The following results are obtained in a greedy mode (no language model rescoring).
Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.

Language CER WER
base-V2 large XL base-V2 large XL
Afrikaans 19.8 13.0 12.4 59.1 42.3 39.8
Amharic 13.3 9.9 10.3 44.3 32.9 34.3
Fula 16.8 15.4 16.4 54.2 50.9 52.7
Ganda 10.3 9.4 9.0 49.4 46.9 45.6
Hausa 8.5 6.6 5.5 28.1 21.6 19.6
Igbo 15.8 13.2 12.8 49.7 44.2 43.3
Kamba 14.5 11.4 10.7 50.2 41.8 39.7
Lingala 6.9 4.9 4.3 20.4 14.9 13.6
Luo 7.6 6.1 5.8 33.6 28.0 27.0
Northen-Sotho 10.7 8.4 8.0 35.9 28.8 33.7
Nyanja 10.6 8.0 7.0 44.5 35.3 32.7
Oromo 19.4 18.2 18.3 73.1 66.9 67.7
Shona 7.3 5.1 4.7 34.6 24.6 23.2
Somali 19.1 15.5 15.3 58.6 49.8 49.2
Swahili 4.8 3.3 2.7 17.6 12.0 10.1
Umbundu 18.3 15.1 14.6 53.7 47.7 50.6
Wolof 16.3 13.7 12.4 48.7 42.2 40.0
Xhosa 8.9 6.7 6.3 42.2 34.6 33.5
Yoruba 21.6 19.9 19.0 62.2 57.9 55.9
Zulu 9.1 6.7 6.2 42.1 33.3 31.0
Overall average 13.0 10.5 10.1 45.1 37.8 37.2

References

[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Orange/SSA-HuBERT-base-60k-V2