Soloba-TDT-600M Series
soloba-tdt-0.6b-v0.5 is a fine tuned version of nvidia/parakeet-tdt-0.6b-v2 on the African Next Voices dataset (ANV). This model does not consistently produce Capitalizations and Punctuations and it cannot produce acoustic event tags like those found in the ANV dataset in its transcriptions. It was fine-tuned using NVIDIA NeMo.
This model is the only one of the v0 series that was not trained on RobotsMali/bam-asr-early or any derivative of Jeli-ASR, hence its particular naming (v0.5)
π¨ Important Note
This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:
- The model may not generalize very well accross all speaking conditions and dialects.
- Community feedback is welcome, and contributions are encouraged to refine the model further.
NVIDIA NeMo: Training
To fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install nemo-toolkit['asr']
How to Use This Model
Note that this model has been released for research purposes primarily.
Load Model with NeMo
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-tdt-0.6b-v0.5")
Transcribe Audio
model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
Input
This model accepts any mono-channel audio (wav files) as input and resamples them to 16 kHz sample rate before performing the forward pass
Output
This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample. (nemo>=2.3)
Model Architecture
This model uses a FastConformer Ecoder and a Convolutional decoder with CTC Loss. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: Fast-Conformer Model.
Training
The NeMo toolkit was used for finetuning this model for 82,628 steps over nvidia/parakeet-tdt-0.6b-v2 model. The finetuning codes and configurations can be found at RobotsMali-AI/bambara-asr.
The tokenizer for this model was trained on the text transcripts of the train set of RobotsMali/afvoices using this script.
Dataset
This model was fine-tuned on a 100 hours pre-completion subset of the African Next Voices dataset. You can reconstitute that subset with these manifest files
Performance
We report the Word Error Rate (WER) and Character Error Rate (CER) for this model:
| Benchmark | Decoding | WER (%) β | CER (%) β |
|---|---|---|---|
| African Next Voices (afvoices) | TDT | 29.75 | 13.50 |
| Nyana Eval | TDT | 42.43 | 23.34 |
License
This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.
Feel free to open a discussion on Hugging Face or file an issue on GitHub for help or contributions.
- Downloads last month
- 6
Model tree for RobotsMali/soloba-tdt-0.6b-v0.5
Base model
nvidia/parakeet-tdt-0.6b-v2Dataset used to train RobotsMali/soloba-tdt-0.6b-v0.5
Evaluation results
- Test WER on African Next Voicestest set self-reported29.755
- Test CER on African Next Voicestest set self-reported13.499
- Test WER on Nyana Evaltest set self-reported42.430
- Test CER on Nyana Evaltest set self-reported23.346