---
license: cc-by-nc-4.0
---
# DSpAST: Disentangled Spatial Audio Spectrogram Transformer

[arXiv](https://arxiv.org/abs/2509.13927) | [GitHub](https://github.com/wilkinghoff/DSpAST)

Checkpoints of [DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models](https://arxiv.org/abs/2509.13927).

***

## Performance

On our system, the performances obtained with our provided checkpoints are:

| Binaural Encoder | mAP (↑) | ER20° (↓) | MAE (↓) | DER (↓) |
| :---: | :---: | :---: | :---: | :---: |
| [SpatialAST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) | 49.90 | 24.43 | 17.87 | 32.50 |
| [DSpAST (stage 1)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage1) | 53.05 | 98.56 | 95.57 | 97.58 |
| [DSpAST (stage 2)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage2) | 52.64 | 20.31 | **14.44** | 28.35 |
| [DSpAST (stage 3)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage3) | **54.53** | **20.28** | **14.44** | **28.03** |

Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our [paper](https://arxiv.org/abs/2509.13927) for further information.

***

## References

If you use the checkpoints for your work, we kindly ask you to cite the following papers:

``` latex
@article{wilkinghoff2025dspast,
    author     = {Wilkinghoff, Kevin and
                  Tan, Zheng-Hua},
    title      = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
    journal    = {arXiv:2509.13927},
    year       = {2025}
}
```
and the original [BAT](https://zhishengzheng.com/bat/) paper, which is the foundation of this work:
``` latex
@inproceedings{zheng2024bat,
  author       = {Zheng, Zhisheng and
                  Peng, Puyuan and
                  Ma, Ziyang and
                  Chen, Xie and
                  Choi, Eunsol and
                  Harwath, David},
  title        = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
  booktitle    = {Proc. ICML},
  year         = {2024}
}
```