---
tags:
- speech-to-text
- vietnamese
- ai-model
- deep-learning
license: apache-2.0
library_name: pytorch
model_name: EfficientConformerVietnamese
language: vi
---
# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163)
## Efficient Conformer Encoder
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.
## Installation
Clone GitHub repository and set up environment
```
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25
```
Install [ctcdecode](https://github.com/parlance/ctcdecode)
## Prepare dataset and training pipline
Dataset to train this mini version:
- Vivos
- Vietbud_500
- VLSP2020, VLSP2021, VLSP2022
- VietMed_labeled
- Google Fleurs
Steps:
- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
- Add noise to the audio using **add_noise.py**.
- Change the speaking speed using **speed_permutation.py**.
- Extract audio length and BPE tokens using **prepare_dataset.py**.
- Filter audio by the maximum length specified, using **filter_max_length.py**, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
- Train the model using **train.py** (please read the parameters carefully).
- Prepare a **lm_corpus.txt** to train **n gram bpe language model**, using **train_lm.py**
## Evaluation
Please read code test.py carefully !
```
bash test.sh
```
## Monitor training
```
tensorboard --logdir callback_path
```
## Vietnamese Performance
| Model | Gigaspeech_test
(Greedy / n-gram Beam Search) | VLSP2023_pb_test
(Greedy / n-gram Beam Search) | VLSP2023_pr_test
(Greedy / n-gram Beam Search) |
|:--------------------------------------|:------------------------------------------------:|:-------------------------------------------------:|:-------------------------------------------------:|
| **EC-Small-CTC** | **19.61 / 17.47** | **23.06 / 20.83** | **23.17 / 21.15** |
| **PhoWhiper-Tiny** | **20.45** | **33.21** | **33.02** |
| **PhoWhiper-Base** | **18.78** | **29.25** | **28.29** |
In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e
## Reference
[Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)
* Maxime Burchi [@burchim](https://github.com/burchim)