---
tags:
- speech-to-text
- vietnamese
- ai-model
- deep-learning
license: apache-2.0
library_name: pytorch
model_name: EfficientConformerVietnamese
language: vi
---

# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163)

## Efficient Conformer Encoder
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering  the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

<img src="EfficientConformer.jpg" width="35%"/>

## Installation
Clone GitHub repository and set up environment
```
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25
```

Install [ctcdecode](https://github.com/parlance/ctcdecode)

## Prepare dataset and training pipline

Dataset to train this mini version:
- Vivos
- Vietbud_500
- VLSP2020, VLSP2021, VLSP2022
- VietMed_labeled
- Google Fleurs

Steps:

- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
- Add noise to the audio using **add_noise.py**.
- Change the speaking speed using **speed_permutation.py**.
- Extract audio length and BPE tokens using **prepare_dataset.py**.
- Filter audio by the maximum length specified, using **filter_max_length.py**, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
- Train the model using **train.py** (please read the parameters carefully).
- Prepare a **lm_corpus.txt** to train **n gram bpe language model**, using **train_lm.py**

## Evaluation
Please read code test.py carefully !
```
bash test.sh
```

## Monitor training

```
tensorboard --logdir callback_path
```

<img src="logs.jpg" width="55%" />

## Vietnamese Performance


| Model                                  | Gigaspeech_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pb_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pr_test<br>(Greedy / n-gram Beam Search) |
|:--------------------------------------|:------------------------------------------------:|:-------------------------------------------------:|:-------------------------------------------------:|
| **EC-Small-CTC**     |              **19.61 / 17.47**                   |               **23.06 / 20.83**                   |               **23.17 / 21.15**                   |
| **PhoWhiper-Tiny**     |              **20.45**                   |               **33.21**                   |               **33.02**                   |
| **PhoWhiper-Base**     |              **18.78**                   |               **29.25**                   |               **28.29**                   |


In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e

## Reference
[Maxime Burchi, Valentin Vielzeuf.	Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)

* Maxime Burchi [@burchim](https://github.com/burchim)