Dynamic Spatial Audio Soundscapes with Latent Diffusion Models

Christian Templin1, Yanda Zhu2, Hao Wang1

1Stevens Institute of Technology, 2Hunan Normal University

Abstract: Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative au- dio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in first- order Ambisonics (FOA). Recent FOA models extend text- to-audio generation but remain restricted to static sources. In this work, we introduce SonicMotion, the first end-to-end la- tent diffusion framework capable of generating FOA audio with explicit control over moving sound sources. SonicMo- tion is implemented in two variations: 1) a descriptive model conditioned on natural language prompts, and 2) a parametric model conditioned on both text and spatial trajectory param- eters for higher precision. To support training and evaluation, we construct a new dataset of over one million simulated FOA caption pairs that include both static and dynamic sources with annotated azimuth, elevation, and motion attributes. Ex- periments show that SonicMotion achieves state-of-the-art se- mantic alignment and perceptual quality comparable to lead- ing text-to-audio systems, while uniquely attaining low spa- tial localization error.

Descriptive Model Demos

Prompt: "A police siren wails from the front and slowly moves counter-clockwise to the back"

Download WAV

Prompt: "A lion roars from the back right and below"

Download WAV

Prompt: "A laughing man slowly moves counter-clockwise from the left to the front"

Download WAV

Prompt: "A bird chrips from the front and above, then flies below"

Download WAV