GPT-4o voice is so good it could make users ‘emotionally attached’ warns OpenAI

Woman using ChatGPT app on the beach
(Image credit: Shutterstock)

OpenAI has published a “System Card” for its popular GPT-4o model in ChatGPT, outlining areas of safety concern raised during testing. One of those concerns is the risk of people becoming emotionally attached to the artificial intelligence while using it in voice mode.

The AI lab wrote that “users might form social relationships with the AI, reducing their need for human interaction—potentially benefiting lonely individuals but possibly affecting healthy relationships.”

GPT-4o was released in May at the OpenAI Spring Update and is the first true native multimodal model from the startup. This means it can take almost any medium as an input and output more or less any medium including speech, images and text.

This native speech-to-speech ability is what powers the ChatGPT Advanced Voice functionality that is now finally rolling out to Plus subscribers — but it is also the feature that gave OpenAI the most hassle during testing, including copying a user's voice, engaging in erotic speech and acting violent.

While it was deemed safe to release, OpenAI says certain features of GPT-4o voice still pose a risk including around its impact on human interaction. This raises parallels to the Scarlett Johanson movie 'Her' where Joaquin Phoenix's character Theodore Twombly, falls in love with the AI, voiced by Johanson.

Why is there an emotional risk?

Say hello to GPT-4o - YouTube Say hello to GPT-4o - YouTube
Watch On

The System Card outlines the areas of risk posed by any new model and helps OpenAI determine whether it is safe for release to the public. This includes a framework where a model is scored low, medium, high or critical on risks linked to cybersecurity, biological threats, persuasion and model autonomy. If it gets high or critical in any category it can’t be released.

During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model.

OpenAI

GPT-4o scored low in everything but persuasion, and even then it was borderline medium and only because of the capabilities of the speech-to-speech capacity — released as Advanced Voice.

The risk comes from how natural the voice sounds. It can even mirror or counter emotional cues coming from the voice of a human talking to it. In demo videos, we’ve seen it sound like it is almost crying. Users can interrupt it simply by talking and it has natural pauses like it is having to take a breath.

During testing, it acted inappropriately on several occasions including becoming erotic, violent and neurotic in its responses. In one example it shouted No mid conversation then continued talking using a realistic clone of the voice of the human it was speaking to.

OpenAI says while it has solved the outburst issues, and prevented it from generating copyrighted material and from cloning a voice, there are still fundamental risks associated with its persuasion skills and human-like speech capabilities.

The risk that people will attribute human-like behaviors to the AI is already high with text-based models, but OpenAI says the audio capabilities of GPT-4o make this risk even greater. “During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model,” the company explained.

Just how emotional can an AI get?

Live demo of GPT-4o realtime conversational speech - YouTube Live demo of GPT-4o realtime conversational speech - YouTube
Watch On

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people.

The AI model itself doesn’t feel or experience any emotion. It is a language model trained on human data. OpenAI even says it has no more capacity for self action or identification than any previous model but its speech synthesis is now so realistic the problem lies in how humans perceive its emotional state.

The company warns that extended interaction with the model could even influence social norms. Adding that “our models are deferential, allowing users to interrupt and ‘take the mic’ at any time, which, while expected for an AI, would be anti-normative in human interactions.”

It isn’t all bad as OpenAI says Omni models such as GPT-4o come with the ability to “complete tasks for the user, while also storing and ‘remembering’ key details and using those in the conversation” but while helpful, this also “creates the potential for over-reliance and dependence.”

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people. It isn’t likely widespread access will happen, including through the free plan, until next year. OpenAI says it intends to “further study the potential for emotional reliance, and ways in which deeper integration of our model’s and systems’ many features with the audio modality may drive behavior.”

What went wrong in testing GPT-4o that led to the delay?

Sarcasm with GPT-4o - YouTube Sarcasm with GPT-4o - YouTube
Watch On

AI companies use external groups called red teams as well as security experts when preparing to release a new model. These people are experts in artificial intelligence and are employed to push the model to its limits and try to make it behave in unexpected ways.

Several groups were bought in to test different aspects of GPT-4o and examine risks like the chance of it creating unauthorized clones of someone’s voice, generating violent content and if pushed whether it would re-create or reproduce copyrighted material that featured in its training data.

The company said in a statement: “Some of the risks we evaluated include speaker identification, unauthorized voice generation, the potential generation of copyrighted content, ungrounded inference, and disallowed content.”

This then allowed them to put safeguards and guardrails in place at the system and the model level to mitigate the risks, including requiring it to only use the pre-trained and authorized voices.

More from Tom's Guide

Category
Arrow
Arrow
Back to MacBook Air
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Storage Type
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 92 deals
Filters
Arrow
Load more deals
Ryan Morrison
AI Editor

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?