New! Sign up for our email newsletter on Substack.

An AI That Edits Your Soundscape Just by Being Told What You Want to Hear

Take a ten-second clip of a forest in the rain. Rainfall, maybe the odd branch creak, the muffled pressure of wet air. Now suppose you want it to sound like a sunny afternoon instead: birdsong brighter, rain gone, a gentle rustling where the downpour was. Traditionally, getting from one to the other means opening a mixing console and working through it layer by layer, deciding what to strip out and what to layer in. It’s painstaking. It assumes you already know the language of audio production. And for anyone working in virtual reality, gaming or immersive media, it’s a bottleneck that never quite goes away.

Engineers at the University of Pennsylvania think they have a better approach. Their system, called SmartDJ, lets users skip the manual steps entirely and just describe the result they want in plain language: “make this sound like a sunny forest,” or “put me in a quiet library.” The system figures out what needs to change and carries out the edits automatically.

Two Models, One Pipeline

The core problem SmartDJ is trying to solve is something researchers call declarative editing, where you declare the desired outcome rather than specifying the operations needed to achieve it. That sounds almost trivially simple, but it turns out to be genuinely hard for AI systems, because understanding a request and generating audio are tasks that current models approach quite differently. Language models, the technology underlying chatbots, are good at parsing what users mean. Diffusion models, which generate audio (and images) by gradually shaping random noise into something coherent, are good at producing sounds. Neither is particularly good at what the other does. SmartDJ’s solution is to run both in sequence, with the language model acting as a kind of producer, breaking the user’s prompt into a list of specific atomic operations, and then handing that list to a diffusion model to carry out. “The language model gives the system direction,” says Yiduo Hao, a doctoral student at Penn and one of the paper’s co-authors. “The diffusion model performs those directions.”

In practice, those atomic operations are fairly granular: add a sound event, remove one, isolate one from a mix, turn a specific element up or down by a set number of decibels, shift a sound’s apparent spatial position from left to right. Unremarkable individually, perhaps. But strung together by a system that has actually listened to the original audio and understood the instruction, they can achieve quite a lot.

There’s something else here too, which matters for anyone interested in immersive audio specifically. Most previous AI audio-editing tools worked only with mono sound, single-channel recordings with no spatial information. SmartDJ was designed from the ground up for stereo, meaning it can preserve or reshape the sense of where sounds appear to come from. This turns out to be harder than it sounds (so to speak), because spatial cues depend on subtle phase and amplitude differences between left and right channels, information that mel-spectrogram-based approaches, which earlier systems relied on, tend to discard. SmartDJ instead encodes audio as waveform latents, keeping that spatial information intact through the editing process.

Building the Dataset from Scratch

Getting the system to work required training data that simply didn’t exist. “This problem needed a very unusual kind of data set,” says Zitong Lan, the paper’s first author. “It had to capture the goal, the steps and the result all at once.” So the team built it themselves, using GPT-4o to act as a sound designer, generating high-level editing instructions and the intermediate atomic steps to carry them out, while audio signal processing software served as the composer, actually rendering each step. The result was roughly 50,000 complex editing pairs and half a million single-step examples.

Outperforming the Field

The results of testing SmartDJ against earlier methods are fairly striking. In objective evaluations across multiple standard audio quality metrics, including Fréchet Distance, Fréchet Audio Distance and log-spectral distance, SmartDJ outperformed all baselines. In human perceptual studies involving 19 participants, more than 80% of listeners preferred SmartDJ’s output for audio quality, and more than 87% said it aligned better with the given instruction and preserved more of the original audio’s content. The spatial metrics tell a similar story: for tasks involving relocating a sound from one direction to another, competing methods left the sound essentially where it was, while SmartDJ’s edits matched the target closely. One telling experiment involved adding a sound and then removing it again, five times over. A good editor should end up back where it started. SmartDJ came closest.

There’s a transparency built into the system worth noting. When SmartDJ interprets a prompt, it doesn’t just go off and return an edited file; it produces a readable list of the steps it plans to take, something like “remove the sound of drilling; turn up the sound of typewriter typing by 2dB; add the sound of phone ringing at right by 3dB.” Users can inspect that plan, remove steps they don’t want, or add ones the system missed. “With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” says Mingmin Zhao, an assistant professor of computer science at Penn and the paper’s senior author. “We show that AI can help people edit audio in intuitive ways using simple language.”

There are limits, of course. Supporting a new type of atomic editing operation currently requires retraining the diffusion model, which is not a small ask. The team also notes that the two components, language model and diffusion model, are trained separately rather than end-to-end, which leaves something on the table.

Zhao frames SmartDJ as part of a broader shift in how AI is changing creative tools. Text and image editing already respond to high-level instructions in ways that were hard to imagine even five years ago. Audio has lagged behind, partly because spatial realism adds a dimension of complexity that flat media don’t have to contend with. For other media, he says, users can already make high-level editing requests; SmartDJ is trying to close that gap for sound.

Whether it scales to the full complexity of professional audio production remains to be seen. But for anyone who has ever wanted to simply tell a computer what a scene should sound like and let it work out the details, that possibility is a bit closer than it was.

https://arxiv.org/abs/2509.21625


Frequently Asked Questions

Why can’t existing audio AI tools just respond to plain-language instructions like “make this sound like a library”?

Most current tools were designed around template commands, requiring users to specify individual operations like “add bird sounds” or “remove rain.” They also lack the ability to actually listen to and analyse the source audio, so they can’t reason about what needs to change to achieve a described scene. SmartDJ adds an audio language model that hears the original clip and interprets the high-level goal before passing a detailed plan to a separate sound-generating system.

Does SmartDJ just regenerate the audio from scratch?

No, and that’s actually one of the harder design constraints. The system is required to preserve everything in the original recording that the instruction doesn’t call for changing. In a test where researchers added and then removed the same sound five times over, SmartDJ’s output drifted less from the original than any competing method, suggesting it is genuinely editing rather than generating a fresh scene.

Why does spatial audio make this problem so much harder?

Stereo positioning depends on tiny differences in timing and volume between the left and right audio channels. Earlier AI editing systems typically converted audio into a format called a mel-spectrogram, which compresses those cues away. SmartDJ encodes audio differently, preserving the spatial information, which is why it can successfully move a sound from front to right without the cues collapsing. This matters a lot for VR and AR, where where a sound seems to come from is as important as what it sounds like.

Can users see what changes SmartDJ is planning before it edits?

Yes. Before executing anything, the system generates a readable list of its intended steps, such as removing a specific sound, adjusting another’s volume by a given number of decibels, and adding a new element at a particular spatial position. Users can revise or override individual steps, which means SmartDJ can function as a collaborative tool rather than a black box.


Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.