Stt Rw Conformer Transducer Large

| |

This model is a finetuned version nvidia/stt_rw_conformer_transducer_large. It was finetuned on Mozilla Common Voice 22 and Digital Umuganda track A datasets containing about 2000 hours and 500 hours of kinyarwanda speech respectively. See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.

pip install nemo_toolkit['asr']

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

from nemo.collections.asr.models import EncDecRNNTBPEModel
asr_model = EncDecRNNTBPEModel.from_pretrained("WakandaAI/stt_rw_conformer_transducer_large")

Transcribing using Python

Then simply do:

output = asr_model.transcribe(['sample.wav'])
print(output[0].text)

Input

This model accepts 16 kHz mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

Conformer-Transducer Model

Training

Using the pretrained model nvidia/stt_rw_conformer_transducer_large, this model was finetuned using the MCV 22 and Digital Umuganda track A datasets, and evaluated on the dev and test splits.

Datasets

Mozilla Common Voice 22 (rw) Digital Umuganda track A

Preprocessing was done by converting to lowercase characters and removing all punctuations except the apostrophe. Some instances in the MCV dataset used backticks for apostrophe and this was accounted for in the preprocessing. This was done using

re.sub(r"[^\w\s']", "", x.strip().lower().replace("`", "'").replace("’", "'"))

Performance

Dataset	Split	Model	WER	CER
MCV 22	DEV	WakandaAI/stt_rw_conformer_transducer_large	14.24	4.31
		Nvidia/stt_rw_conformer_transducer_large	14.30	4.47

MCV 22	TEST	WakandaAI/stt_rw_conformer_transducer_large	16.35	5.29
		Nvidia/stt_rw_conformer_transducer_large	16.71	5.74

DU	DEV	WakandaAI/stt_rw_conformer_transducer_large	25.03	4.78
		Nvidia/stt_rw_conformer_transducer_large	29.86	6.59

MCV 22 - Mozilla Common Voice Version 22

DU - Digital Umuganda

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

License

License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

References

[1] NVIDIA NeMo Toolkit

Downloads last month: 44

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WakandaAI/stt_rw_conformer_transducer_large

Base model

nvidia/stt_rw_conformer_transducer_large

Finetuned

(1)

this model