GPT-4o voice is so good it could make users ‘emotionally attached’ warns OpenAI

(Image credit: Shutterstock)

OpenAI has published a “System Card” for its popular GPT-4o model in ChatGPT, outlining areas of safety concern raised during testing. One of those concerns is the risk of people becoming emotionally attached to the artificial intelligence while using it in voice mode.

The AI lab wrote that “users might form social relationships with the AI, reducing their need for human interaction—potentially benefiting lonely individuals but possibly affecting healthy relationships.”

This native speech-to-speech ability is what powers the ChatGPT Advanced Voice functionality that is now finally rolling out to Plus subscribers — but it is also the feature that gave OpenAI the most hassle during testing, including copying a user's voice, engaging in erotic speech and acting violent.

While it was deemed safe to release, OpenAI says certain features of GPT-4o voice still pose a risk including around its impact on human interaction. This raises parallels to the Scarlett Johanson movie 'Her' where Joaquin Phoenix's character Theodore Twombly, falls in love with the AI, voiced by Johanson.

Why is there an emotional risk?

Say hello to GPT-4o - YouTube

Watch On

The System Card outlines the areas of risk posed by any new model and helps OpenAI determine whether it is safe for release to the public. This includes a framework where a model is scored low, medium, high or critical on risks linked to cybersecurity, biological threats, persuasion and model autonomy. If it gets high or critical in any category it can’t be released.

During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model.
OpenAI

GPT-4o scored low in everything but persuasion, and even then it was borderline medium and only because of the capabilities of the speech-to-speech capacity — released as Advanced Voice.

The risk comes from how natural the voice sounds. It can even mirror or counter emotional cues coming from the voice of a human talking to it. In demo videos, we’ve seen it sound like it is almost crying. Users can interrupt it simply by talking and it has natural pauses like it is having to take a breath.

From the GPT-4o System Card published today: "During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.""... model outbursts “No!” then begins continuing the sentence in a similar sounding voice to the… https://t.co/sMqnQbBOlW pic.twitter.com/BYciQkfmf5August 8, 2024

During testing, it acted inappropriately on several occasions including becoming erotic, violent and neurotic in its responses. In one example it shouted No mid conversation then continued talking using a realistic clone of the voice of the human it was speaking to.

OpenAI says while it has solved the outburst issues, and prevented it from generating copyrighted material and from cloning a voice, there are still fundamental risks associated with its persuasion skills and human-like speech capabilities.

The risk that people will attribute human-like behaviors to the AI is already high with text-based models, but OpenAI says the audio capabilities of GPT-4o make this risk even greater. “During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model,” the company explained.

Just how emotional can an AI get?

Live demo of GPT-4o realtime conversational speech - YouTube

Watch On

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people.

The AI model itself doesn’t feel or experience any emotion. It is a language model trained on human data. OpenAI even says it has no more capacity for self action or identification than any previous model but its speech synthesis is now so realistic the problem lies in how humans perceive its emotional state.

The company warns that extended interaction with the model could even influence social norms. Adding that “our models are deferential, allowing users to interrupt and ‘take the mic’ at any time, which, while expected for an AI, would be anti-normative in human interactions.”

It isn’t all bad as OpenAI says Omni models such as GPT-4o come with the ability to “complete tasks for the user, while also storing and ‘remembering’ key details and using those in the conversation” but while helpful, this also “creates the potential for over-reliance and dependence.”

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people. It isn’t likely widespread access will happen, including through the free plan, until next year. OpenAI says it intends to “further study the potential for emotional reliance, and ways in which deeper integration of our model’s and systems’ many features with the audio modality may drive behavior.”

What went wrong in testing GPT-4o that led to the delay?

Sarcasm with GPT-4o - YouTube

Watch On

AI companies use external groups called red teams as well as security experts when preparing to release a new model. These people are experts in artificial intelligence and are employed to push the model to its limits and try to make it behave in unexpected ways.

Several groups were bought in to test different aspects of GPT-4o and examine risks like the chance of it creating unauthorized clones of someone’s voice, generating violent content and if pushed whether it would re-create or reproduce copyrighted material that featured in its training data.

The company said in a statement: “Some of the risks we evaluated include speaker identification, unauthorized voice generation, the potential generation of copyrighted content, ungrounded inference, and disallowed content.”

This then allowed them to put safeguards and guardrails in place at the system and the model level to mitigate the risks, including requiring it to only use the pre-trained and authorized voices.

More from Tom's Guide

Back to MacBook Air

Apple

Asus

Lenovo

Intel Core i7

Intel Pentium

8GB RAM

16GB RAM

128GB

256GB

1TB

Black

Grey

Silver

New

Refurbished

EMMC

SSD

Showing 10 of 15 deals

Filters☰

(8GB RAM SSD)

Asus Zenbook S 13 OLED

(13.3-inch 1TB)

$1,849.99

View

Lenovo IdeaPad Duet 3

$369.99

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB Silver)

Our Review

☆☆☆☆☆

(256GB 16GB RAM)

Asus Zenbook S 13 OLED

(OLED)

$1,399.99

View

Lenovo IdeaPad Duet 3

(128GB 8GB RAM)

$387.85

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB SSD)

Our Review

☆☆☆☆☆

$1,799

$1,646.12

View

Asus Zenbook S 13 OLED

(OLED)

$1,599

View

Lenovo IdeaPad Duet 3

(128GB 8GB RAM)

$429.99

$399.97

View

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on AI and technology speak for him than engage in this self-aggrandising exercise. As the former AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing.

GET TG ACCESS QUICK

Why is there an emotional risk?

Just how emotional can an AI get?

What went wrong in testing GPT-4o that led to the delay?

More from Tom's Guide