Google DeepMind creates AI model that can add sound to silent videos
Charlie Chaplin's days are numbered.
Fresh off of animating memes over the last few days, AI has turned its attention to silent videos. Specifically, bringing audio to AI-generated clips.
Google’s DeepMind research arm has built a powerful new AI model that can add audio to videos without sound, dubbing over the top with sound effects and music.
What is most impressive about the new research is the ability to accurately follow the visuals. In one clips they show a close up of guitar playing and the music in the SFX closely matches the actual notes being played.
In some ways, it’s the other side of the coin that saw the generation of music based on a visual prompt last month via ElevenLabs and brings with it plenty of potential for restoration of old media that no longer has an audio component — and Charlie Chaplin may be about to get a new voice if this progresses further.
While the Google DeepMind model isn't available to use yet, there is a similar tool from ElevenLabs that you can try today. If you want to create a video to try it you can check out our 5 best AI video generators list.
Google's new audio generation is off to a solid start
In the thread of posts on X, Google’s DeepMind account starts things off with a character walking through an eerily lit tunnel.
Some light choir music can be heard over the top of dramatic percussion as the character’s footsteps can be heard as they move through the scene.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
The second, audio generated with “Wolf howling at the moon” as the prompt, ties in nicely with the animation, and even offers a chorus of howls in the distance.
We're sharing progress on our video-to-audio (V2A) generative technology. 🎥It can add sound to silent clips that match the acoustics of the scene, accompany on-screen action, and more.Here are 4 examples - turn your sound on. 🧵🔊 https://t.co/VHpJ2cBr24 pic.twitter.com/S5m159Ye62June 17, 2024
The harmonica example sounds a little too “uncanny valley” in the way its pitch shifts, but the backing underneath is solid, while the jellyfish one sounds like, well, jellyfish. Notably, that has some extra prompts, though, including “marine life” and “ocean”.
The video with the prompt “A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd” is a little off, though. For one, the beats don’t quite match the rhythm in the video once it gets going, while the sticks appear to be focused on the snare and maybe a floor tom, while the audio sounds a tad more complex with some other drums involved.
Still, it’s an impressive start to a project that’s only likely to grow with time.
Limitations of the DeepMind model
Like many projects from Google this hasn't been released yet, its just a research preview. Google says there are limitations and safety issues to address first.
For example: "Since the quality of the audio output is dependent on the quality of the video input, artefacts or distortions in the video, which are outside the model’s training distribution, can lead to a noticeable drop in audio quality."
They are also working on lip synching for videos with speech as, while it currently attempts to do this it isn't always accurate and creates an uncanny valley effect.
ElevenLabs is working on a similar project
We are excited to introduce the Text to Sound Effects API. To showcase it - we've built the first Video to Sounds Effects app. This app is available for free online and fully open-source. pic.twitter.com/8aalo8GCSoJune 17, 2024
Not to be outdone, ElevenLabs this week revealed its new Text to Sound Effects API that can generate audio effects based on what you upload to it.
Unlike Google's V2A model, the API from ElevenLabs is already accessible and from experiments works surprisingly well.
In the example above, a video of a bottle smashing gets a few different options to choose from, while the DiCaprio laughing meme gets a additional audio from other people in the room.
The company 'bootstrapped' a quick app to demonstrate what is possible with the API, allowing you to upload a video and have it add the sound. This is free to use and open source, and you can try it right now.
ElevenLabs told Tom's Guide the real aim is to have other companies and developers build things with the API themselves, such as integrating into generative video.
More from Tom's Guide
A freelance writer from Essex, UK, Lloyd Coombes began writing for Tom's Guide in 2024 having worked on TechRadar, iMore, Live Science and more. A specialist in consumer tech, Lloyd is particularly knowledgeable on Apple products ever since he got his first iPod Mini. Aside from writing about the latest gadgets for Future, he's also a blogger and the Editor in Chief of GGRecon.com. On the rare occasion he’s not writing, you’ll find him spending time with his son, or working hard at the gym. You can find him on Twitter @lloydcoombes.