If you thought Sora was impressive now watch it with AI generated sound from ElevenLabs

(Image credit: OpenAI)

Artificial intelligence speech startup ElevenLabs offered an insight into what its planning to release in the future, adding sound effects to AI generated video for the first time.

Best known for its near human-like text-to-speech and synthetic voice services, ElevenLabs added artificially generated sound effects to videos produced using OpenAI’s Sora.

OpenAI unveiled its impressive Sora text-to-video artificial intelligence model last week, showcasing some of the most realistic, consistent and longest AI generated video to date.

ElevenLabs says it isn’t ready to release its text-to-sfx model yet but when live it will be able to create a full range of sounds including footsteps, waves and ambience. The company wrote on X: "We were blown away by the Sora announcement but felt it needed something... What if you could describe a sound and generate it with AI?"

ElevenLabs expanding to include sounds

ElevenLabs was founded in 2022 and is seen as producing the most realistic synthetic voices, generating speech that is close enough to natural to be almost undetectable.

The U.K.-based startup reached billion dollar value unicorn status at the start of this year with its most recent $80 million Series B round. This announcement of the funding round came with a new tool for synching AI speech in video for auto translations — taking on the international dubbing market.

There are already some text-to-sfx models on the market, often built around music AI models including myEdit, AudioGen and Stable Audio from StabilityAI. The sounds from ElevenLabs appear to be among the most natural but it isn’t clear how much editing was involved.

It isn’t currently clear when text-to-sfx will launch but ElevenLabs has released a waitlist sign-up that asks for a “prompt you might use to create a sound”.

What does this mean for AI video?

OpenAI Sora video of eye — (Image credit: OpenAI)

The next stage will likely be tools that can analyze the content of a video and automatically add sound effects at exactly the right points. The same could apply to music. Most AI music tools are currently text-to-music, but in future with multimodality, they could go from image or video.

One of the dreams of generative AI has been the ability to create an entire, fully rounded piece of content from a single prompt.

At the moment that is barely a dream, let alone close to reality but with advances like text-to-sfx, improved AI video and synthetic voice — it is getting closer.

More from Tom's Guide

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?

ElevenLabs expanding to include sounds

Sign up to get the BEST of Tom's Guide direct to your inbox.

What does this mean for AI video?

More from Tom's Guide