Stt Rw Conformer Transducer Large

Model architecture | Model size | Language

This model is a finetuned version nvidia/stt_rw_conformer_transducer_large. It was finetuned on Mozilla Common Voice 22 and Digital Umuganda track A datasets containing about 2000 hours and 500 hours of kinyarwanda speech respectively. See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.

pip install nemo_toolkit['asr']

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

from nemo.collections.asr.models import EncDecRNNTBPEModel
asr_model = EncDecRNNTBPEModel.from_pretrained("WakandaAI/stt_rw_conformer_transducer_large")

Transcribing using Python

Then simply do:

output = asr_model.transcribe(['sample.wav'])
print(output[0].text)

Input

This model accepts 16 kHz mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

Conformer-Transducer Model

Training

Using the pretrained model nvidia/stt_rw_conformer_transducer_large, this model was finetuned using the MCV 22 and Digital Umuganda track A datasets, and evaluated on the dev and test splits.

Datasets

Mozilla Common Voice 22 (rw) Digital Umuganda track A

Preprocessing was done by converting to lowercase characters and removing all punctuations except the apostrophe. Some instances in the MCV dataset used backticks for apostrophe and this was accounted for in the preprocessing. This was done using

re.sub(r"[^\w\s']", "", x.strip().lower().replace("`", "'").replace("โ€™", "'"))

Performance

Dataset Split Model WER CER
MCV 22 DEV WakandaAI/stt_rw_conformer_transducer_large 14.24 4.31
Nvidia/stt_rw_conformer_transducer_large 14.30 4.47
MCV 22 TEST WakandaAI/stt_rw_conformer_transducer_large 16.35 5.29
Nvidia/stt_rw_conformer_transducer_large 16.71 5.74
DU DEV WakandaAI/stt_rw_conformer_transducer_large 25.03 4.78
Nvidia/stt_rw_conformer_transducer_large 29.86 6.59

MCV 22 - Mozilla Common Voice Version 22

DU - Digital Umuganda

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

License

License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

References

[1] NVIDIA NeMo Toolkit

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for WakandaAI/stt_rw_conformer_transducer_large

Finetuned
(1)
this model