--- library_name: transformers license: apache-2.0 base_model: ntu-spml/distilhubert tags: - generated_from_trainer metrics: - accuracy model-index: - name: accent-id-distilhubert-finetuned-l2-arctic2 results: [] --- # accent-id-distilhubert-finetuned-l2-arctic2 This model is a fine-tuned version of [ntu-spml/distilhubert](https://huggingface.co/ntu-spml/distilhubert) on 50% of the l2-arctic2 dataset from https://psi.engr.tamu.edu/l2-arctic-corpus/. It achieves the following results on the evaluation set: - Loss: 0.0004 - Accuracy: 1.0 ## Model description The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin. ## How to use this model on an audio file ``` from huggingface_hub import notebook_login notebook_login() from transformers import pipeline pipe = pipeline("audio-classification", model="kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2") import torch import torchaudio audio, sr = torchaudio.load('path_to_file/audio.wav') # Load audio, make sure it is mono, not stereo audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(audio) audio = audio.squeeze().numpy() result = pipe(audio, top_k=6) print(result) print('First language of this speaker is predicted to be ' + result[0]['label'] + ' with ' + str(result[0]['score']*100) + '% confidence') ``` ## Intended uses & limitations The model is very accurate for novel recordings from the original dataset that were not used for train/test. However, the model is not accurate for voices from outside the dataset. Unfortunetely with only 24 speakers represented, it seems like the model memorized other characteristics of these voices besides accent, thus not creating a model very generalizable to the real world. ## Training and evaluation data The [L2-Arctic](https://psi.engr.tamu.edu/l2-arctic-corpus/) data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese. There's 2 women and 2 men in each language group. For this model, 50% of the L2-Arctic data was used (half the files from each speaker), which were then split 90/10 for train/test. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 10 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:--------:| | 0.5216 | 1.0 | 196 | 0.4383 | 1.0 | | 0.0106 | 2.0 | 392 | 0.0067 | 1.0 | | 0.0038 | 3.0 | 588 | 0.0024 | 1.0 | | 0.0021 | 4.0 | 784 | 0.0013 | 1.0 | | 0.0014 | 5.0 | 980 | 0.0009 | 1.0 | | 0.0011 | 6.0 | 1176 | 0.0007 | 1.0 | | 0.0009 | 7.0 | 1372 | 0.0006 | 1.0 | | 0.0008 | 8.0 | 1568 | 0.0005 | 1.0 | | 0.0007 | 9.0 | 1764 | 0.0004 | 1.0 | | 0.0007 | 10.0 | 1960 | 0.0004 | 1.0 | ### Framework versions - Transformers 4.48.3 - Pytorch 2.5.1+cu124 - Datasets 3.3.2 - Tokenizers 0.21.0