Stt Rw Conformer Transducer Large
This model is a finetuned version nvidia/stt_rw_conformer_transducer_large. It was finetuned on Mozilla Common Voice 22 and Digital Umuganda track A datasets containing about 2000 hours and 500 hours of kinyarwanda speech respectively. See the model architecture section and NeMo documentation for complete architecture details.
NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.
pip install nemo_toolkit['asr']
How to Use this Model
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
from nemo.collections.asr.models import EncDecRNNTBPEModel
asr_model = EncDecRNNTBPEModel.from_pretrained("WakandaAI/stt_rw_conformer_transducer_large")
Transcribing using Python
Then simply do:
output = asr_model.transcribe(['sample.wav'])
print(output[0].text)
Input
This model accepts 16 kHz mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
Model Architecture
Training
Using the pretrained model nvidia/stt_rw_conformer_transducer_large, this model was finetuned using the MCV 22 and Digital Umuganda track A datasets, and evaluated on the dev and test splits.
Datasets
Mozilla Common Voice 22 (rw) Digital Umuganda track A
Preprocessing was done by converting to lowercase characters and removing all punctuations except the apostrophe. Some instances in the MCV dataset used backticks for apostrophe and this was accounted for in the preprocessing. This was done using
re.sub(r"[^\w\s']", "", x.strip().lower().replace("`", "'").replace("โ", "'"))
Performance
| Dataset | Split | Model | WER | CER |
|---|---|---|---|---|
| MCV 22 | DEV | WakandaAI/stt_rw_conformer_transducer_large | 14.24 | 4.31 |
| Nvidia/stt_rw_conformer_transducer_large | 14.30 | 4.47 | ||
| MCV 22 | TEST | WakandaAI/stt_rw_conformer_transducer_large | 16.35 | 5.29 |
| Nvidia/stt_rw_conformer_transducer_large | 16.71 | 5.74 | ||
| DU | DEV | WakandaAI/stt_rw_conformer_transducer_large | 25.03 | 4.78 |
| Nvidia/stt_rw_conformer_transducer_large | 29.86 | 6.59 |
MCV 22 - Mozilla Common Voice Version 22
DU - Digital Umuganda
Limitations
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
License
License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.
References
- Downloads last month
- 44
Model tree for WakandaAI/stt_rw_conformer_transducer_large
Base model
nvidia/stt_rw_conformer_transducer_large