USAD models
					Collection
				
USAD: Universal Speech and Audio Representation via Distillation
					β’ 
				4 items
				β’ 
				Updated
					
				
Universal Speech and Audio Distillation (USAD) is a unified speech, sound, and music encoder distilled from domain-specific teachers. Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.
USAD models are all transformer encoders operating at 50Hz frame rate. The teacher models are WavLM Base+ and ATST Frame.
| Model | Parameters | Dim | Layer | Checkpoint | 
|---|---|---|---|---|
| USAD Small | 24M | 384 | 12 | link | 
| USAD Base | 94M | 768 | 12 | link | 
| USAD Large | 330M | 1024 | 24 | link | 
Installation
pip install -U transformers
Load Model and Extract Features
import torch
from transformers import AutoModel
# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Large", trust_remote_code=True).cuda().eval()
# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load
# Extract features
with torch.no_grad():
    results = model(wav)
# result["x"]:              model final output (batch_size, seq_len)
# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
See usad_model.py for more details about the model.
@article{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  journal={arXiv preprint arXiv:2506.18843},
  year={2025}
}
Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.