DSpAST: Disentangled Spatial Audio Spectrogram Transformer

arXiv | GitHub

Checkpoints of DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models.


Performance

On our system, the performances obtained with our provided checkpoints are:

Binaural Encoder mAP (โ†‘) ER20ยฐ (โ†“) MAE (โ†“) DER (โ†“)
SpatialAST 49.90 24.43 17.87 32.50
DSpAST (stage 1) 53.05 98.56 95.57 97.58
DSpAST (stage 2) 52.64 20.31 14.44 28.35
DSpAST (stage 3) 54.53 20.28 14.44 28.03

Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our paper for further information.


References

If you use the checkpoints for your work, we kindly ask you to cite the following papers:

@article{wilkinghoff2025dspast,
    author     = {Wilkinghoff, Kevin and
                  Tan, Zheng-Hua},
    title      = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
    journal    = {arXiv:2509.13927},
    year       = {2025}
}

and the original BAT paper, which is the foundation of this work:

@inproceedings{zheng2024bat,
  author       = {Zheng, Zhisheng and
                  Peng, Puyuan and
                  Ma, Ziyang and
                  Chen, Xie and
                  Choi, Eunsol and
                  Harwath, David},
  title        = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
  booktitle    = {Proc. ICML},
  year         = {2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support