---
library_name: transformers
license: apache-2.0
base_model: ntu-spml/distilhubert
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: accent-id-distilhubert-finetuned-l2-arctic2
  results: []
---

# accent-id-distilhubert-finetuned-l2-arctic2

This model is a fine-tuned version of [ntu-spml/distilhubert](https://huggingface.co/ntu-spml/distilhubert) on 50% of the l2-arctic2 dataset from https://psi.engr.tamu.edu/l2-arctic-corpus/.
It achieves the following results on the evaluation set:
- Loss: 0.0004
- Accuracy: 1.0

## Model description

The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin.

## How to use this model on an audio file

```
from huggingface_hub import notebook_login
notebook_login()

from transformers import pipeline
pipe = pipeline("audio-classification", model="kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2")

import torch
import torchaudio

audio, sr = torchaudio.load('path_to_file/audio.wav')  # Load audio, make sure it is mono, not stereo
audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(audio)
audio = audio.squeeze().numpy()

result = pipe(audio, top_k=6)

print(result)
print('First language of this speaker is predicted to be ' + result[0]['label'] + ' with ' + str(result[0]['score']*100) + '% confidence')
```
## Intended uses & limitations

The model is very accurate for novel recordings from the original dataset that were not used for train/test. However, the model is not accurate for voices from outside the dataset.  Unfortunetely with only 24 speakers represented, it seems like the model memorized other characteristics of these voices besides accent, thus not creating a model very generalizable to the real world.

## Training and evaluation data

The [L2-Arctic](https://psi.engr.tamu.edu/l2-arctic-corpus/) data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.
For this model, 50% of the L2-Arctic data was used (half the files from each speaker), which were then split 90/10 for train/test.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 0.5216        | 1.0   | 196  | 0.4383          | 1.0      |
| 0.0106        | 2.0   | 392  | 0.0067          | 1.0      |
| 0.0038        | 3.0   | 588  | 0.0024          | 1.0      |
| 0.0021        | 4.0   | 784  | 0.0013          | 1.0      |
| 0.0014        | 5.0   | 980  | 0.0009          | 1.0      |
| 0.0011        | 6.0   | 1176 | 0.0007          | 1.0      |
| 0.0009        | 7.0   | 1372 | 0.0006          | 1.0      |
| 0.0008        | 8.0   | 1568 | 0.0005          | 1.0      |
| 0.0007        | 9.0   | 1764 | 0.0004          | 1.0      |
| 0.0007        | 10.0  | 1960 | 0.0004          | 1.0      |


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0