DSpAST: Disentangled Spatial Audio Spectrogram Transformer
Checkpoints of DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models.
Performance
On our system, the performances obtained with our provided checkpoints are:
| Binaural Encoder | mAP (โ) | ER20ยฐ (โ) | MAE (โ) | DER (โ) |
|---|---|---|---|---|
| SpatialAST | 49.90 | 24.43 | 17.87 | 32.50 |
| DSpAST (stage 1) | 53.05 | 98.56 | 95.57 | 97.58 |
| DSpAST (stage 2) | 52.64 | 20.31 | 14.44 | 28.35 |
| DSpAST (stage 3) | 54.53 | 20.28 | 14.44 | 28.03 |
Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our paper for further information.
References
If you use the checkpoints for your work, we kindly ask you to cite the following papers:
@article{wilkinghoff2025dspast,
author = {Wilkinghoff, Kevin and
Tan, Zheng-Hua},
title = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
journal = {arXiv:2509.13927},
year = {2025}
}
and the original BAT paper, which is the foundation of this work:
@inproceedings{zheng2024bat,
author = {Zheng, Zhisheng and
Peng, Puyuan and
Ma, Ziyang and
Chen, Xie and
Choi, Eunsol and
Harwath, David},
title = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
booktitle = {Proc. ICML},
year = {2024}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support