Push model using huggingface_hub.
Browse files- .gitattributes +1 -0
- README.md +154 -0
- soloni-114m-tdt-ctc-v1.nemo +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
soloni-114m-tdt-ctc-v1.nemo filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- bm
|
| 4 |
+
library_name: nemo
|
| 5 |
+
datasets:
|
| 6 |
+
- RobotsMali/kunkado
|
| 7 |
+
|
| 8 |
+
thumbnail: null
|
| 9 |
+
tags:
|
| 10 |
+
- automatic-speech-recognition
|
| 11 |
+
- speech
|
| 12 |
+
- audio
|
| 13 |
+
- Transducer
|
| 14 |
+
- TDT
|
| 15 |
+
- FastConformer
|
| 16 |
+
- Conformer
|
| 17 |
+
- pytorch
|
| 18 |
+
- Bambara
|
| 19 |
+
- NeMo
|
| 20 |
+
license: cc-by-4.0
|
| 21 |
+
base_model: RobotsMali/soloni-114m-tdt-ctc-V0
|
| 22 |
+
model-index:
|
| 23 |
+
- name: soloni-114m-tdt-ctc-v1
|
| 24 |
+
results:
|
| 25 |
+
- task:
|
| 26 |
+
name: Automatic Speech Recognition
|
| 27 |
+
type: automatic-speech-recognition
|
| 28 |
+
dataset:
|
| 29 |
+
name: kunkado (human-reviewed)
|
| 30 |
+
type: RobotsMali/kunkado
|
| 31 |
+
split: test
|
| 32 |
+
args:
|
| 33 |
+
language: bm
|
| 34 |
+
metrics:
|
| 35 |
+
- name: Test WER (TDT)
|
| 36 |
+
type: wer
|
| 37 |
+
value: 42.862898111343384
|
| 38 |
+
- name: Test WER (CTC)
|
| 39 |
+
type: wer
|
| 40 |
+
value: 39.15117383003235
|
| 41 |
+
|
| 42 |
+
metrics:
|
| 43 |
+
- wer
|
| 44 |
+
pipeline_tag: automatic-speech-recognition
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
# Soloni TDT-CTC 114M Bambara
|
| 48 |
+
|
| 49 |
+
<style>
|
| 50 |
+
img {
|
| 51 |
+
display: inline;
|
| 52 |
+
}
|
| 53 |
+
</style>
|
| 54 |
+
|
| 55 |
+
[](#model-architecture)
|
| 56 |
+
| [](#model-architecture)
|
| 57 |
+
| [](#datasets)
|
| 58 |
+
|
| 59 |
+
`soloni-114m-tdt-ctc-v1` is a fine tuned version of nvidia's [`RobotsMali/soloni-114m-tdt-ctc-V0`](https://huggingface.co/RobotsMali/soloni-114m-tdt-ctc-V0) on [RobotsMali/kunkado](https://huggingface.co/datasets/RobotsMali/kunkado). This model cannot write Punctuations and Capitalizations since these were absent from its training. The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
|
| 60 |
+
|
| 61 |
+
The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages.
|
| 62 |
+
|
| 63 |
+
## **🚨 Important Note**
|
| 64 |
+
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:
|
| 65 |
+
|
| 66 |
+
- **The model may not generalize very well accross all speaking conditions and dialects.**
|
| 67 |
+
- **Community feedback is welcome, and contributions are encouraged to refine the model further.**
|
| 68 |
+
|
| 69 |
+
## NVIDIA NeMo: Training
|
| 70 |
+
|
| 71 |
+
To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
pip install nemo_toolkit['asr']
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## How to Use This Model
|
| 78 |
+
|
| 79 |
+
Note that this model has been released for research purposes primarily.
|
| 80 |
+
|
| 81 |
+
### Load Model with NeMo
|
| 82 |
+
```python
|
| 83 |
+
import nemo.collections.asr as nemo_asr
|
| 84 |
+
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-114m-tdt-ctc-v1")
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Transcribe Audio
|
| 88 |
+
```python
|
| 89 |
+
model.eval()
|
| 90 |
+
# Assuming you have a test audio file named sample_audio.wav
|
| 91 |
+
asr_model.transcribe(['sample_audio.wav'])
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Input
|
| 95 |
+
|
| 96 |
+
This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass
|
| 97 |
+
|
| 98 |
+
### Output
|
| 99 |
+
|
| 100 |
+
This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)
|
| 101 |
+
|
| 102 |
+
## Model Architecture
|
| 103 |
+
|
| 104 |
+
This model uses a Hybrid FastConformer-TDT-CTC architecture. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
| 105 |
+
|
| 106 |
+
It also features two independent decoders:
|
| 107 |
+
|
| 108 |
+
- A reccurent neural net Transducer that jointly preditcs tokens and their durations, the so called [Token-and-Duration][https://arxiv.org/abs/2304.06795] Transcuder by Nvidia
|
| 109 |
+
- A classical Convolutional Neural Net with CTC loss, the ***Connectionist Temporal Classification*** decoder
|
| 110 |
+
|
| 111 |
+
## Training
|
| 112 |
+
|
| 113 |
+
The NeMo toolkit (version 2.3.0) was used for finetuning this model for **100,551 steps** over `RobotsMali/soloni-114m-tdt-ctc-V0` model. This version is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/kunkado-training/config/soloni/soloni-v1.4.0.yaml). The full training configurations, scripts, and experimental logs are available here:
|
| 114 |
+
|
| 115 |
+
🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr)
|
| 116 |
+
|
| 117 |
+
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 118 |
+
|
| 119 |
+
## Dataset
|
| 120 |
+
This model was fine-tuned on the [kunkado](https://huggingface.co/datasets/RobotsMali/kunkado) dataset, the human-reviewed subset, which consists of **~40 hours of transcribed Bambara speech data**. The text was normalized with the [bambara-normalizer](https://pypi.org/project/bambara-normalizer/) prior to training, normalizing numbers, removing punctuations, removings tags and converting to lower case.
|
| 121 |
+
|
| 122 |
+
## Performance
|
| 123 |
+
|
| 124 |
+
The performance of Automatic Speech Recognition models is measured using Word Error Rate. Since this model has two decoders operating independently, each decoder is evaluated independently too.
|
| 125 |
+
|
| 126 |
+
The following table summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of **Word Error Rate (WER%)**.
|
| 127 |
+
|
| 128 |
+
|**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-all**|**kunkado**|
|
| 129 |
+
|---------|-----------------------|-----------------|---------|---------|
|
| 130 |
+
| CTC (v0) | BPE | 1024 | 40.6 | - |
|
| 131 |
+
| TDT (v0) | BPE | 1024 | 66.7 | - |
|
| 132 |
+
| CTC (v1) | BPE | 512 | - | 39.15 |
|
| 133 |
+
| TDT (v1) | BPE | 512 | - | 42.86 |
|
| 134 |
+
|
| 135 |
+
These are greedy WER numbers without external LM. By default the main decoder branch is the TDT branch, if you would like to switch to the CTC decoder simply run this block of code before calling the .transcribe method
|
| 136 |
+
|
| 137 |
+
```python
|
| 138 |
+
# Retrieve the CTC decoding config
|
| 139 |
+
ctc_decoding_cfg = model.cfg.aux_ctc.decoding
|
| 140 |
+
# Then change the decoding strategy
|
| 141 |
+
asr_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
|
| 142 |
+
# Transcribe with the CTC decoder
|
| 143 |
+
asr_model.transcribe(['sample_audio.wav'])
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## License
|
| 147 |
+
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on github if you have any contributions
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
soloni-114m-tdt-ctc-v1.nemo
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dfeb47ed17160893408e2bd6157d1967fbee3d23748cde34614771181b8a53f6
|
| 3 |
+
size 455526400
|