GENERATING MOVING 3D SOUNDSCAPES WITH LATENT DIFFUSION MODELS

Christian Templin1, Yanda Zhu2, Hao Wang1

1Stevens Institute of Technology, 2Hunan Normal University

Abstract: Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative audio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in first-order Ambisonics (FOA). Recent FOA models extend text-to-audio generation but remain restricted to static sources. In this work, we introduce SonicMotion, the first end-to-end latent diffusion framework capable of generating FOA audio with explicit control over moving sound sources. SonicMotion is implemented in two variations: 1) a descriptive model conditioned on natural language prompts, and 2) a parametric model conditioned on both text and spatial trajectory parameters for higher precision. To support training and evaluation, we construct a new dataset of over one million simulated FOA caption pairs that include both static and dynamic sources with annotated azimuth, elevation, and motion attributes. Experiments show that SonicMotion achieves state-of-the-art semantic alignment and perceptual quality comparable to leading text-to-audio systems, while uniquely attaining low spatial localization error.

Descriptive Model Demos

Prompt: "A police siren wails from the front and slowly moves counter-clockwise to the back"

Download WAV

Prompt: "A lion roars from the back right and below"

Download WAV

Prompt: "A laughing man slowly moves counter-clockwise from the left to the front"

Download WAV

Prompt: "A bird chrips from the front and above, then flies below"

Download WAV