AST Fine-tuned for Non-Speech Sound Classification
This model is a fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 on the Nonspeech7k dataset.
Model Details
- Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
- Fine-tuned on: Nonspeech7k dataset
- Classes: breath, cough, crying, laugh, screaming, sneeze, yawn
- Sample Rate: 16kHz
- Input Length: 10 seconds (160,000 samples)
Usage
from transformers import ASTFeatureExtractor, ASTForAudioClassification
import torch
import torchaudio
# Load model
feature_extractor = ASTFeatureExtractor.from_pretrained("FizzyBrain/ast-nonspeech7k-finetuned")
model = ASTForAudioClassification.from_pretrained("FizzyBrain/ast-nonspeech7k-finetuned")
# Load and preprocess audio
waveform, sample_rate = torchaudio.load("audio.wav")
inputs = feature_extractor(waveform, sampling_rate=16000, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = predictions.argmax().item()
Classes
- breath
- cough
- crying
- laugh
- screaming
- sneeze
- yawn
Training Details
- Fine-tuned using advanced augmentation techniques
- Class-weighted loss for imbalanced data
- Layer-wise learning rate decay
- Early stopping with macro-F1 monitoring
- Downloads last month
- -
Model tree for FizzyBrain/ast-nonspeech7k-finetuned
Base model
MIT/ast-finetuned-audioset-10-10-0.4593