--- license: cc-by-nc-4.0 --- # DSpAST: Disentangled Spatial Audio Spectrogram Transformer [arXiv](https://arxiv.org/abs/2509.13927) | [GitHub](https://github.com/wilkinghoff/DSpAST) Checkpoints of [DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models](https://arxiv.org/abs/2509.13927). *** ## Performance On our system, the performances obtained with our provided checkpoints are: | Binaural Encoder | mAP (↑) | ER20° (↓) | MAE (↓) | DER (↓) | | :---: | :---: | :---: | :---: | :---: | | [SpatialAST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) | 49.90 | 24.43 | 17.87 | 32.50 | | [DSpAST (stage 1)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage1) | 53.05 | 98.56 | 95.57 | 97.58 | | [DSpAST (stage 2)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage2) | 52.64 | 20.31 | **14.44** | 28.35 | | [DSpAST (stage 3)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage3) | **54.53** | **20.28** | **14.44** | **28.03** | Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our [paper](https://arxiv.org/abs/2509.13927) for further information. *** ## References If you use the checkpoints for your work, we kindly ask you to cite the following papers: ``` latex @article{wilkinghoff2025dspast, author = {Wilkinghoff, Kevin and Tan, Zheng-Hua}, title = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models}, journal = {arXiv:2509.13927}, year = {2025} } ``` and the original [BAT](https://zhishengzheng.com/bat/) paper, which is the foundation of this work: ``` latex @inproceedings{zheng2024bat, author = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David}, title = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models}, booktitle = {Proc. ICML}, year = {2024} } ```