Dynamic Spatial Audio Soundscapes with Latent Diffusion Models
Christian Templin1, Yanda Zhu2, Hao Wang1
1Stevens Institute of Technology, 2Hunan Normal University
Abstract: Spatial audio has become central to immersive applications
such as VR/AR, cinema, and music. Existing generative au-
dio models are largely limited to mono or stereo formats and
cannot capture the full 3D localization cues available in first-
order Ambisonics (FOA). Recent FOA models extend text-
to-audio generation but remain restricted to static sources. In
this work, we introduce SonicMotion, the first end-to-end la-
tent diffusion framework capable of generating FOA audio
with explicit control over moving sound sources. SonicMo-
tion is implemented in two variations: 1) a descriptive model
conditioned on natural language prompts, and 2) a parametric
model conditioned on both text and spatial trajectory param-
eters for higher precision. To support training and evaluation,
we construct a new dataset of over one million simulated FOA
caption pairs that include both static and dynamic sources
with annotated azimuth, elevation, and motion attributes. Ex-
periments show that SonicMotion achieves state-of-the-art se-
mantic alignment and perceptual quality comparable to lead-
ing text-to-audio systems, while uniquely attaining low spa-
tial localization error.
Descriptive Model Demos
Prompt: "A police siren wails from the front and slowly moves counter-clockwise to the back"