---
library_name: transformers
tags:
- accent
license: cc-by-sa-4.0
language:
- en
metrics:
- f1
base_model:
- facebook/mms-lid-256
pipeline_tag: audio-classification
---

# Model Card for Model ID

Classifies voice input into 11 English Accents

## Model Details

This model is a finetune of facebook/mms-lid-256 on the [speech accent archive dataset](https://accent.gmu.edu/)

It classies voice into 11 English Accents:\
        "0": "African"\
        "1": "Australian"\
        "2": "British"\
        "3": "EastAsian"\
        "4": "EasternEuropean"\
        "5": "LatinAmerican"\
        "6": "MiddleEastern"\
        "7": "NorthAmerican"\
        "8": "SouthAsian"\
        "9": "SouthEastAsian"\
        "10": "WesternEuropean"


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results:

> Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.


### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package

```python
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torchaudio
import torch

def load_and_preprocess_audio(path):

    waveform, sr = torchaudio.load(path)

    # Resample to 16kHz because mms uses Wav2Vec
    if sr != sample_rate:
        waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Remove channel dimension and convert to 1D
    waveform = waveform.squeeze(0)

    inputs = feature_extractor(
        waveform,
        sampling_rate=sample_rate,
        return_tensors="pt",
        padding="max_length",
        max_length=sample_rate * max_audio_length,
        truncation=True
    )

    return inputs.input_values

id_to_class = {
  0: "African",
  1: "Australian",
  2: "British",
  3: "EastAsian",
  4: "EasternEuropean",
  5: "LatinAmerican",
  6: "MiddleEastern",
  7: "NorthAmerican",
  8: "SouthAsian",
  9: "SouthEastAsian",
  10: "WesternEuropean" 
}

sample_rate = 16000
max_audio_length = 15

model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256")

sample = "audio_input.mp3"
inputs = load_and_preprocess_audio(sample)

predictions = model(inputs)
pred_label = torch.argmax(predictions['logits']).item()

print(id_to_class[pred_label])

```


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from [kaggle](https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data)
Data is then further split into training and validation set of size 1698 and 425 respectively

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Accuracy on the validation set: 0.86 (f1 score)

![c](mms_eval.png)