--- library_name: transformers tags: - accent license: cc-by-sa-4.0 language: - en metrics: - f1 base_model: - facebook/mms-lid-256 pipeline_tag: audio-classification --- # Model Card for Model ID Classifies voice input into 11 English Accents ## Model Details This model is a finetune of facebook/mms-lid-256 on the [speech accent archive dataset](https://accent.gmu.edu/) It classies voice into 11 English Accents:\ "0": "African"\ "1": "Australian"\ "2": "British"\ "3": "EastAsian"\ "4": "EasternEuropean"\ "5": "LatinAmerican"\ "6": "MiddleEastern"\ "7": "NorthAmerican"\ "8": "SouthAsian"\ "9": "SouthEastAsian"\ "10": "WesternEuropean" ## Uses Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results: > Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. ### Direct Use You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package ```python from transformers import AutoModelForAudioClassification, AutoFeatureExtractor import torchaudio import torch def load_and_preprocess_audio(path): waveform, sr = torchaudio.load(path) # Resample to 16kHz because mms uses Wav2Vec if sr != sample_rate: waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform) # Convert to mono if stereo if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True) # Remove channel dimension and convert to 1D waveform = waveform.squeeze(0) inputs = feature_extractor( waveform, sampling_rate=sample_rate, return_tensors="pt", padding="max_length", max_length=sample_rate * max_audio_length, truncation=True ) return inputs.input_values id_to_class = { 0: "African", 1: "Australian", 2: "British", 3: "EastAsian", 4: "EasternEuropean", 5: "LatinAmerican", 6: "MiddleEastern", 7: "NorthAmerican", 8: "SouthAsian", 9: "SouthEastAsian", 10: "WesternEuropean" } sample_rate = 16000 max_audio_length = 15 model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict") feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256") sample = "audio_input.mp3" inputs = load_and_preprocess_audio(sample) predictions = model(inputs) pred_label = torch.argmax(predictions['logits']).item() print(id_to_class[pred_label]) ``` ## Training Details ### Training Data The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from [kaggle](https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data) Data is then further split into training and validation set of size 1698 and 425 respectively ## Evaluation Accuracy on the validation set: 0.86 (f1 score) ![c](mms_eval.png)