Soloba-TDT-600M Series

| |

soloba-tdt-0.6b-v0.5 is a fine tuned version of nvidia/parakeet-tdt-0.6b-v2 on the African Next Voices dataset (ANV). This model does not consistently produce Capitalizations and Punctuations and it cannot produce acoustic event tags like those found in the ANV dataset in its transcriptions. It was fine-tuned using NVIDIA NeMo.

This model is the only one of the v0 series that was not trained on RobotsMali/bam-asr-early or any derivative of Jeli-ASR, hence its particular naming (v0.5)

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:

The model may not generalize very well accross all speaking conditions and dialects.
Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install nemo-toolkit['asr']

How to Use This Model

Note that this model has been released for research purposes primarily.

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-tdt-0.6b-v0.5")

Transcribe Audio

model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])

Input

This model accepts any mono-channel audio (wav files) as input and resamples them to 16 kHz sample rate before performing the forward pass

Output

This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample. (nemo>=2.3)

Model Architecture

This model uses a FastConformer Ecoder and a Convolutional decoder with CTC Loss. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: Fast-Conformer Model.

Training

The NeMo toolkit was used for finetuning this model for 82,628 steps over nvidia/parakeet-tdt-0.6b-v2 model. The finetuning codes and configurations can be found at RobotsMali-AI/bambara-asr.

The tokenizer for this model was trained on the text transcripts of the train set of RobotsMali/afvoices using this script.

Dataset

This model was fine-tuned on a 100 hours pre-completion subset of the African Next Voices dataset. You can reconstitute that subset with these manifest files

Performance

We report the Word Error Rate (WER) and Character Error Rate (CER) for this model:

Benchmark	Decoding	WER (%) ↓	CER (%) ↓
African Next Voices (afvoices)	TDT	29.75	13.50
Nyana Eval	TDT	42.43	23.34

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.

Feel free to open a discussion on Hugging Face or file an issue on GitHub for help or contributions.

Downloads last month: 6

Model tree for RobotsMali/soloba-tdt-0.6b-v0.5

Base model

nvidia/parakeet-tdt-0.6b-v2

Finetuned

(16)

this model

Dataset used to train RobotsMali/soloba-tdt-0.6b-v0.5

Evaluation results

Test WER on African Next Voices
test set self-reported

29.755
Test CER on African Next Voices
test set self-reported

13.499
Test WER on Nyana Eval
test set self-reported

42.430
Test CER on Nyana Eval
test set self-reported

23.346

View on Papers With Code