Moshi Chat's GPT-4o advanced voice competitor tried to argue with me — OpenAI doesn't need to worry just yet
It can run on your device
Moshi Chat is a new native speech AI model from French startup Kyutai, promising a similar experience to GPT-4o where it understands your tone of voice and can be interrupted.
Unlike GPT-4o, Moshi is a smaller model and can be installed locally and run offline. This could be perfect for the future of smart home appliances — if they can improve the responsiveness.
I had several conversations with Moshi. Each lasts up to five minutes in the current online demo and in every case it ended with it repeating the same word over and over, losing cohesion.
In one of the conversations it started to argue with me, flat out refusing to tell me a story, demanding instead to state a fact and wouldn’t let up until I said “tell me a fact.”
This is all likely an issue of context window size and compute resources that can be easily solved over time. While OpenAI doesn’t need to worry about the competition from Moshi yet, it does show that as with Sora, where Luma Labs, Runway and others are pressing against its quality — others are catching up.
What is Moshi Chat?
Moshi Chat is the brainchild of the Kyutai research lab and was built from scratch six months ago by a team of eight researchers. The goal is to make it open and build on the new model over time, but this is the first openly accessible native generative voice AI.
“This new type of technology makes it possible for the first time to communicate in a smooth, natural and expressive way with an AI,” the company said in a statement.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Here at Tom’s Guide our expert editors are committed to bringing you the best news, reviews and guides to help you stay informed and ahead of the curve!
Its core functionality is similar to OpenAI’s GPT-4o but from a much smaller model. It is also available to use today, whereas GPT-4o advanced voice won’t be widely available until Fall.
The team suggests Moshi could be used in roleplay scenarios or even as a coach to spur you on while you train. The plan is to work with the community and make it open so others can build on top of and further fine-tune the AI.
It is a 7B parameter multimodal model called Helium trained on text and audio codecs, but Moshi is speech in speech out natively. It can run on an Nvidia GPU, Apple's Metal or a CPU.
What happens next with Moshi?
Kyutai hopes that the community support will be used to enhance Moshi's knowledge base and factuality. These have been limited because it is a lightweight base model, but it is hoped that expanding these aspects in combination with native speech will create a powerful assistant.
The next stage is to further refine the model and scale it up to allow for more complex and longer form conversations with Moshi.
In using it and from watching the demos I’ve found it incredibly fast and responsive for the first minute or so, but the longer the conversation goes on the more incoherent it becomes. Its lack of knowledge is also obvious and if you cal it out for making a mistake it gets flustered and goes into a loop of "I’m sorry, I’m sorry, I’m sorry."
This isn’t a direct competitor for OpenAI’s GPT-4o advanced voice yet, even though advanced voice isn’t currently available. But, offering an open, locally running model that has the potential to work in much the same way is a significant step forward for open source AI development.
More from Tom's Guide
- I just tried Runway’s new AI voiceover tool — and it’s way more natural sounding than I expected
- Hume AI brings its creepy emotional AI chatbot to iPhone
- ChatGPT Voice could change storytelling forever — new video shows it creating custom character voices
Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?