File size: 3,577 Bytes

b9db7a4

---
language: en
tags:
- audio
- speech
- emotion-recognition
- wav2vec2
datasets:
- TESS
- CREMA-D
- SAVEE
- RAVDESS
license: mit
metrics:
- accuracy
- f1
---

# wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

## Model Description

- **Model Architecture:** Wav2Vec2 with sequence classification head
- **Language:** English
- **Task:** Speech Emotion Recognition
- **Fine-tuned from:** facebook/wav2vec2-base
- **Datasets:** Combined emotion datasets
 - [TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)
 - [CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad)
 - [SAVEE](https://www.kaggle.com/datasets/barelydedicated/savee-database)
 - [RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)

## Performance Metrics
- **Accuracy:** 79.57%
- **F1 Score:** 79.43%

## Supported Emotions
- 😠 Angry
- 😌 Calm
- 🤢 Disgust
- 😨 Fearful
- 😊 Happy
- 😐 Neutral
- 😢 Sad
- 😲 Surprised

## Training Details

The model was trained with the following configuration:
- **Epochs:** 15
- **Batch Size:** 16
- **Learning Rate:** 5e-5
- **Optimizer:** AdamW
- **Weight Decay:** 0.03
- **Gradient Accumulation Steps:** 2
- **Mixed Precision:** fp16

For detailed training process, check out the [Fine-tuning Notebook](https://colab.research.google.com/drive/1VNhIjY7gW29d0uKGNDGN0eOp-pxr_pFL?usp=drive_link)

## Limitations

### Audio Requirements:

- Sampling rate: 16kHz (will be automatically resampled)
- Maximum duration: 1 minute
- Clear speech with minimal background noise recommended

### Performance Considerations:

- Best results with clear speech audio
- Performance may vary with different accents
- Background noise can affect accuracy


## Demo
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

## Contact
* **GitHub**: [DGautam11](https://github.com/DGautam11)
* **LinkedIn**: [Deepan Gautam](https://www.linkedin.com/in/deepan-gautam)  
* **Hugging Face**: [@Dpngtm](https://huggingface.co/Dpngtm)

For issues and questions, feel free to:
1. Open an issue on the [Model Repository](https://huggingface.co/Dpngtm/wav2vec2-emotion-recognition)
2. Comment on the [Demo Space](https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition)

## Usage

```python
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]