Commit
·
4a90298
1
Parent(s):
04736a9
Audio transcription examples added and model names changed
Browse files- .gitattributes +4 -0
- README.md +76 -9
- audio_samples/example1.wav +3 -0
- audio_samples/example2.wav +3 -0
- audio_samples/example3.wav +3 -0
- images/cer.png +0 -0
- images/wer.png +0 -0
.gitattributes
CHANGED
|
@@ -37,3 +37,7 @@ unigrams.txt filter=lfs diff=lfs merge=lfs -text
|
|
| 37 |
language_model/3gram.bin filter=lfs diff=lfs merge=lfs -text
|
| 38 |
language_model/attrs.json filter=lfs diff=lfs merge=lfs -text
|
| 39 |
language_model/unigrams.txt filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
language_model/3gram.bin filter=lfs diff=lfs merge=lfs -text
|
| 38 |
language_model/attrs.json filter=lfs diff=lfs merge=lfs -text
|
| 39 |
language_model/unigrams.txt filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
audio_samples/example1.wav filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
audio_samples/example2.wav filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
audio_samples/example3.wav filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
audio_samples/example4.wav filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -53,13 +53,80 @@ Next you can use the model using the `transformers` Python package as follows:
|
|
| 53 |
{'text': 'your transcription'}
|
| 54 |
```
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
## Model Details
|
| 57 |
|
| 58 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
| 59 |
```
|
| 60 |
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
|
| 61 |
```
|
| 62 |
-
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
|
| 63 |
## Dataset
|
| 64 |
|
| 65 |
### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
|
|
@@ -84,8 +151,8 @@ The model was evaluated using the following metrics:
|
|
| 84 |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
| 85 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
| 86 |
| [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 1540M | Read-aloud and conversation | 5.3% ± 0.2% | 12.0% ± 0.4% |
|
| 87 |
-
| [Alvenir/
|
| 88 |
-
| [alexandrainst/roest-
|
| 89 |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
| 90 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
|
| 91 |
|
|
@@ -97,7 +164,7 @@ The model was evaluated using the following metrics:
|
|
| 97 |
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
|
| 98 |
|
| 99 |
### Table CER scores in % of evaluation across demographics on the CoRal test data
|
| 100 |
-
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 |
|
| 101 |
|:---:|:---:|:---:|:---:|:---:|
|
| 102 |
| female | 7.2 | 7.4 | 6.9 | 5.1 |
|
| 103 |
| male | 5.7 | 5.8 | 3.7 | 3.6 |
|
|
@@ -117,7 +184,7 @@ The model was evaluated using the following metrics:
|
|
| 117 |
| Overall | 6.5 | 6.6 | 5.3 | 4.3 |
|
| 118 |
|
| 119 |
### Table WER scores in % of evaluation across demographics on the CoRal test data
|
| 120 |
-
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 |
|
| 121 |
|:---:|:---:|:---:|:---:|:---:|
|
| 122 |
| female | 17.7 | 18.5 | 14.2 | 11.5 |
|
| 123 |
| male | 14.9 | 15.5 | 9.9 | 9.4 |
|
|
@@ -138,19 +205,19 @@ The model was evaluated using the following metrics:
|
|
| 138 |
|
| 139 |
|
| 140 |
### Roest-wav2vec2-315M with and without language model
|
| 141 |
-
The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
|
| 142 |
|
| 143 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
| 144 |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
| 145 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
| 146 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
|
| 147 |
-
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
| 148 |
-
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
| 149 |
|
| 150 |
### Detailed Roest-wav2vec2-315M with and without language model on different dialects
|
| 151 |
Here are the results of the model on different danish dialects in the test set:
|
| 152 |
|
| 153 |
-
| | Roest-
|
| 154 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
| 155 |
| LM | No | | Yes | | No | | Yes | |
|
| 156 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
|
|
|
| 53 |
{'text': 'your transcription'}
|
| 54 |
```
|
| 55 |
|
| 56 |
+
## Transcription examples
|
| 57 |
+
|
| 58 |
+
### Example 1
|
| 59 |
+
<audio controls>
|
| 60 |
+
<source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
|
| 61 |
+
Your browser does not support the audio tag.
|
| 62 |
+
</audio>
|
| 63 |
+
|
| 64 |
+
**Dialect:** Vestjysk
|
| 65 |
+
|
| 66 |
+
**Transcription:** det blev til yderlig ti mål i den første sæson på trods af en position som back
|
| 67 |
+
|
| 68 |
+
**Target transcription:** det blev til yderligere ti mål i den første sæson på trods af en position som back
|
| 69 |
+
|
| 70 |
+
**CER:** 3.7%
|
| 71 |
+
|
| 72 |
+
**WER:** 5.9%
|
| 73 |
+
|
| 74 |
+
### Example 2
|
| 75 |
+
<audio controls>
|
| 76 |
+
<source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
|
| 77 |
+
Your browser does not support the audio tag.
|
| 78 |
+
</audio>
|
| 79 |
+
|
| 80 |
+
**Dialect:** Sønderjysk
|
| 81 |
+
|
| 82 |
+
**Transcription:** en arkitektoniske udformning af pladser forslagene iver benzen
|
| 83 |
+
|
| 84 |
+
**Target transcription:** den arkitektoniske udformning af pladsen er forestået af ivar bentsen
|
| 85 |
+
|
| 86 |
+
**CER:** 20.3%
|
| 87 |
+
|
| 88 |
+
**WER:** 60.0%
|
| 89 |
+
|
| 90 |
+
### Example 3
|
| 91 |
+
<audio controls>
|
| 92 |
+
<source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
|
| 93 |
+
Your browser does not support the audio tag.
|
| 94 |
+
</audio>
|
| 95 |
+
|
| 96 |
+
**Dialect:** Nordsjællandsk
|
| 97 |
+
|
| 98 |
+
**Transcription:** østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
|
| 99 |
+
|
| 100 |
+
**Target transcription:** østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
|
| 101 |
+
|
| 102 |
+
**CER:** 0.0%
|
| 103 |
+
|
| 104 |
+
**WER:** 0.0%
|
| 105 |
+
|
| 106 |
+
### Example 4
|
| 107 |
+
<audio controls>
|
| 108 |
+
<source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
|
| 109 |
+
Your browser does not support the audio tag.
|
| 110 |
+
</audio>
|
| 111 |
+
|
| 112 |
+
**Dialect:** Lollandsk
|
| 113 |
+
|
| 114 |
+
**Transcription:** det er produceret af thomas helme og indspillede i easy sound recording studio i københavn
|
| 115 |
+
|
| 116 |
+
**Target transcription:** det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn
|
| 117 |
+
|
| 118 |
+
**CER:** 4.4%
|
| 119 |
+
|
| 120 |
+
**WER:** 13.3%
|
| 121 |
+
|
| 122 |
+
|
| 123 |
## Model Details
|
| 124 |
|
| 125 |
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
|
| 126 |
```
|
| 127 |
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
|
| 128 |
```
|
| 129 |
+
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
|
| 130 |
## Dataset
|
| 131 |
|
| 132 |
### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
|
|
|
|
| 151 |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
| 152 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
| 153 |
| [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 1540M | Read-aloud and conversation | 5.3% ± 0.2% | 12.0% ± 0.4% |
|
| 154 |
+
| [Alvenir/roest-whisper-large-v1](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
| 155 |
+
| [alexandrainst/roest-wav2vec2-315M-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
| 156 |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
| 157 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
|
| 158 |
|
|
|
|
| 164 |
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
|
| 165 |
|
| 166 |
### Table CER scores in % of evaluation across demographics on the CoRal test data
|
| 167 |
+
| Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
|
| 168 |
|:---:|:---:|:---:|:---:|:---:|
|
| 169 |
| female | 7.2 | 7.4 | 6.9 | 5.1 |
|
| 170 |
| male | 5.7 | 5.8 | 3.7 | 3.6 |
|
|
|
|
| 184 |
| Overall | 6.5 | 6.6 | 5.3 | 4.3 |
|
| 185 |
|
| 186 |
### Table WER scores in % of evaluation across demographics on the CoRal test data
|
| 187 |
+
| Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
|
| 188 |
|:---:|:---:|:---:|:---:|:---:|
|
| 189 |
| female | 17.7 | 18.5 | 14.2 | 11.5 |
|
| 190 |
| male | 14.9 | 15.5 | 9.9 | 9.4 |
|
|
|
|
| 205 |
|
| 206 |
|
| 207 |
### Roest-wav2vec2-315M with and without language model
|
| 208 |
+
The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
|
| 209 |
|
| 210 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
| 211 |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
| 212 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
| 213 |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
|
| 214 |
+
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
| 215 |
+
| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
| 216 |
|
| 217 |
### Detailed Roest-wav2vec2-315M with and without language model on different dialects
|
| 218 |
Here are the results of the model on different danish dialects in the test set:
|
| 219 |
|
| 220 |
+
| | Roest-v1 | | Roest-v1 | | Roest-v2 | | Roest-v2 | |
|
| 221 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
| 222 |
| LM | No | | Yes | | No | | Yes | |
|
| 223 |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
|
audio_samples/example1.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:97be8c695d4c6debdd4096cea9400468992ecf27743f313ff8e988271c9b6aae
|
| 3 |
+
size 529978
|
audio_samples/example2.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e97d6c5d5999f8f2c6eed1f2847f4dae0006e7025148a17503b3f836c5f4a57a
|
| 3 |
+
size 249658
|
audio_samples/example3.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5e8c8870082c39d13d1f2800cefb971bd9d56667d1e4437d05feee8e3900e18a
|
| 3 |
+
size 361018
|
images/cer.png
CHANGED
|
|
images/wer.png
CHANGED
|
|