This AI model is learning to speak by watching videos — here's how

Puppies DenseAV
(Image credit: Puppies DenseAV)

The AI model DenseAV is learning the meaning of words and the location of sounds without human input or text simply by watching videos, researchers said.

In a paper, researchers from MIT, Microsoft, Oxford, and Google explained that DenseAV manages to do so using only self-supervision from video. 

To learn these patterns it uses audio-video contrastive learning to associate a particular sound with the observable world. This mode of learning means the visual side of the model can’t gain any insights from the audio side (and vice-versa) forcing the algorithm to recognize objects in a meaningful way. 

It learns by comparing pairs of audio and visual signals and determines what data is important. It then evaluates which signals match and which don’t. Since it’s easier to predict what you are seeing from what you are hearing when you understand language and can recognize sounds, this is how DenseAV can learn without labels.

How does it work?

AI Learns Language from Scratch - YouTube AI Learns Language from Scratch - YouTube
Watch On

The idea for this process struck MIT PhD student Mark Hamilton while he was watching the movie March of the Penguins. There’s a particular scene where a penguin falls and lets out a groan.

“When you watch it, it’s almost obvious that this groan is standing in for a four-letter word. This was the moment where we thought, maybe we need to use audio and video to learn language,” Hamilton said in an MIT news release.

They found that one side of the brain naturally focused on language while the other focused on sounds like meowing.

His aim was to have his model learn a language by predicting what it’s seeing from what it’s hearing and vice-versa. So if you hear someone saying “grab that violin and start playing it” you’re likely going to see a violin or a musician. This game of matching audio to video was repeated across various videos.

Once this was done, the researchers focused on the pixels a model was looking at when it heard a particular sound — someone saying “cat” would trigger the algorithm to start looking for cats in the video. Seeing which pixels the algorithm selects means you can discover what it thinks a particular word means.

But let’s say DenseAV hears someone saying “cat” and it later hears a cat meowing, the AI might still identify an image of a cat in a shot. However, does it mean the algorithm thinks a cat is the same thing as a cat’s meow? 

The researchers explored this by giving DenseAV a “two-sided brain” and they found that one side of the brain naturally focused on language while the other focused on sounds like meowing. So DenseA did actually learn the different meaning of both words without any human intervention.

Why is this useful?

The massive amount of video content already out there means AI can be trained on things like instructional videos.

“Another exciting application is understanding new languages, like dolphin or whale communication, which don’t have a written form of communication,” Hamilton said.

The next step for the team is to create systems that can learn from video- or audio-only data which is helpful in areas where there’s lots of one type of material but less of the other.

More from Tom's Guide

Category
Arrow
Arrow
Back to MacBook Air
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Storage Type
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 43 deals
Filters
Arrow
Show more
Christoph Schwaiger

Christoph Schwaiger is a journalist who mainly covers technology, science, and current affairs. His stories have appeared in Tom's Guide, New Scientist, Live Science, and other established publications. Always up for joining a good discussion, Christoph enjoys speaking at events or to other journalists and has appeared on LBC and Times Radio among other outlets. He believes in giving back to the community and has served on different consultative councils. He was also a National President for Junior Chamber International (JCI), a global organization founded in the USA. You can follow him on Twitter @cschwaigermt.

Read more
OmniHuman screenshot of AI generated video
TikTok parent company just launched stunning AI video generator — OmniHuman-1 is taking the world by storm
two bots chatting
What is 'Gibberlink'? Why it's freaking out the internet after these two AIs talking to each other went viral
ChatGPT logo on a smartphone screen being held outside
ChatGPT just got OpenAI's most powerful upgrade yet — meet 'Deep Research'
An image of the Seemour app on a smartphone showing a notification image of a cat with a caption.
This new AI smart home app lets your house tell you how its day was — yes, really
Hume AI on an iPhone screen
Hume AI just unveiled Octave — new AI voice generator is eerily human
How to use AI to upscale your images for free
New to AI? 5 tips to improve your prompting skills from an AI expert
Latest in AI
Bill Gates in 2019
Bill Gates just predicted the death of every job thanks to AI — except for these three
Gemini screenshot image
Google unveils Gemini 2.5 — claims AI breakthrough with enhanced reasoning and multimodal power
nyc spring day AI image
OpenAI just unveiled enhanced image generator within ChatGPT-4o — here's what you can do now
A nervous woman looking at her phone
Is ChatGPT making us lonely? MIT/OpenAI study reveals possible link
AI in man's hand
AI
AI Madness faceoff logo
I just tested Grok vs. DeepSeek with 7 prompts — here's the winner
Latest in News
Bill Gates in 2019
Bill Gates just predicted the death of every job thanks to AI — except for these three
NYTimes Connections
NYT Connections today hints and answers — Wednesday, March 26 (#654)
Gemini screenshot image
Google unveils Gemini 2.5 — claims AI breakthrough with enhanced reasoning and multimodal power
Samsung Galaxy Z Flip 6 review.
Samsung Galaxy Z Flip 7 design just teased in new cases leak — and the outer display is huge
Google Chrome
Chrome failed to install on Windows PCs, but Google has issued a fix — here's what happened
nyc spring day AI image
OpenAI just unveiled enhanced image generator within ChatGPT-4o — here's what you can do now