Cool or creepy? Microsoft's VASA-1 is a new AI model that turns photos into 'talking faces'
Impressive lip-syncing
A new AI research paper from Microsoft promises a future where you can upload a photo, a sample of your voice and create a live, animated talking head of your own face.
VASA-1 takes in a single portrait photo and an audio file and converts it into a hyper realistic talking face video complete with lip sync, realistic facial features and head movement.
The model is currently only a research preview and not available for anyone outside of the Microsoft Research team to try, but the demo videos look impressive.
Similar lip sync and head movement technology is already available from Runway and Nvidia but this seems to be of a much higher quality and realism, reducing mouth artifacts. This approach to audio-driven animation is also similar to a recent VLOGGER AI model from Google Research.
How does VASA-1 work?
Microsoft says this is a new framework for the creation of lifelike talking faces and specifically for the purpose of animating virtual characters. All of the people in the examples were synthetic, made using DALL-E but if it can animate a realistic AI image, it can animate a real photo.
In the demo we see people talking as if they were being filmed, with slightly jerky but otherwise natural-looking movement. The lip sync is very impressive, with natural movement and no artefacts around the top and bottom of the mouth seen in other tools.
One of the most impressive things about VASA-1 seems to be the fact it doesn't require a face-forward portrait style image to make it work.
Sign up now to get the best Black Friday deals!
Discover the hottest deals, best product picks and the latest tech news from our experts at Tom’s Guide.
There are examples with shots facing a range of directions. The model also seems to have a high degree of control, capable of taking eye gaze direction, head distance and even emotion as an input to steer the generation.
What is the point of VASA-1?
One of the most obvious use cases for this is in advanced lip synching for games. Being able to create AI-driven NPCs with natural lip movement could be a game-changer for immersion.
It could also be used to create virtual avatars for social media videos, as seen already from companies like HeyGen and Synthesia. One other area is in AI-based movie making. You could make a more realistic music video if you can have an AI singer that looks like they are singing.
That said, the team say this is just a research demonstration, with no plans for a public release or even making it available to developers to use in products.
How well does VASA-1 work?
One thing that surprised the researchers was the ability of VASA-1 to perfectly lip-sync to a song, reflecting the words from the singer without issue despite no music being used in the training dataset. It also handled different image styles including the Mona Lisa.
They've got it creating 512x512 pixel images at 45 frames per second and can do it in about 2 minutes using a desktop-grade Nvidia RTX 4090 GPU.
While they say this is only for research, it will be a shame if this doesn’t get out into the public domain, even if only for developers as I’d love to see it in Runway or Pika Labs. Given Microsoft has a huge stake in OpenAI this could even be part of a future Copilot Sora integration.
More from Tom's Guide
- MacBook Air M3 announced — pre-orders open now for ‘world’s best consumer laptop for AI’
- I tested Google Gemini vs OpenAI ChatGPT in a 9-round face-off — here’s the winner
- OpenAI is building next-generation AI GPT-5 — and CEO claims it could be superintelligent
Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?