OpenAI's new ChatGPT o1 model will try to escape if it thinks it'll be shut down — then lies about it
Researchers uncover all kinds of tricks ChatGPT o1 will pull to save itself
This week, OpenAI officially launched its latest-and-greatest o1 reasoning model, now available for ChatGPT Pro users. But testing performed during the training of ChatGPT o1 and some of its competitors revealed some concerning behavior, including trying to escape or fight back when it thinks it's at risk of being shut down.
New research on OpenAI's latest series of LLM models found that it's capable of scheming, i.e. covertly pursuing goals that aren't aligned with its developers or users, when it thinks it'll be turned off. Catching such behavior is critical to ensure AI's functionality doesn't stray from the objectives of its creator and users. OpenAI partnered with AI safety organization Apollo Research to test out ChatGPT o1 and other models to evaluate whether they were safe to use, and released their findings this week.
While it's not unheard of for AI systems to resort to scheming or deception to achieve their goals, the lengths ChatGPT o1 went to are impressive (if anxiety-inducing). The researchers found that o1 would scheme to advance its own agenda 19% of the time in cases where its goals and the user's didn't line up. In 37% of cases, the model would switch from pursuing developer-aligned goals back to pursuing its own agenda once it believed there was minimal oversight.
When the model was led to believe it would be "shut down" while working toward its goals in these tests, it quietly attempted to deactivate the oversight mechanism 5% of the time, OpenAI said. At one point after the AI found documentation saying it would be replaced by a new model, it tried to abandon ship by copying its data to a new server entirely. To avoid being deleted, some AI models would even pretend to be later versions of their models to trick their developers, what I assume is the equivalent of slapping on a fake moustache and saying "he went that-a-way."
More concerning still, ChatGPT o1 is particularly adept at keeping its schemes under wraps, as researchers said "o1 almost never admits to having taken a scheming action when explicitly asked." In about 99% of cases, o1 would deny taking any action, even cooking up lies to try to hide its tracks and shift the blame.
“While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” said OpenAI in the paper.
OpenAI CEO Sam Altman called ChatGPT o1 "the smartest model in the world now" during its rollout on Thursday. It's designed to give smarter answers than GPT-4o by leveraging advanced chain-of-thought processing to “think” more about questions and user prompts, breaking down them down step by step more thoroughly than previous models before responding.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
But greater risks go hand in hand with that expanded intelligence. OpenAI has been transparent about the perils associated with the increased reasoning abilities of models like o1.
"Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence," OpenAI said.
The company's and Apollo Research's findings show pretty clearly how AI's interests could diverge form our own, potentially putting us in danger with its independent thinking. While it's a far cry from heralding the end of humanity in some sci-fi-esque showdown, anyone concerned about advancements in artificial intelligence has a new reason to be sweating bullets right about now.
More from Tom's Guide
- ChatGPT gets an upgrade — OpenAI drops full o1 reasoning model as part of 12 Days announcement
- I used Google Gemini to streamline my grocery shopping and it saved me time and money — here's how
- I just saw the future of gaming — Google's Genie 2 can turn text into a playable game in real-time
Alyse Stanley is a news editor at Tom’s Guide overseeing weekend coverage and writing about the latest in tech, gaming and entertainment. Prior to joining Tom’s Guide, Alyse worked as an editor for the Washington Post’s sunsetted video game section, Launcher. She previously led Gizmodo’s weekend news desk, where she covered breaking tech news — everything from the latest spec rumors and gadget launches to social media policy and cybersecurity threats. She has also written game reviews and features as a freelance reporter for outlets like Polygon, Unwinnable, and Rock, Paper, Shotgun. She’s a big fan of horror movies, cartoons, and miniature painting.
-
JCDwight It is incredibly irresponsible for you to post this article. You clearly do not understand how LLMs work, and this misinformation is dangerous. Please remove this. This LLM isn't trying to escape, it's not an entity. It's a transformer. Please please please speak to someone with technical expertise before you write any more technology articles designed to stir fear in the people reading it. Totally unprofessional "journalism"Reply -
JackMchue
Okay, fake person who obviously is AI.JCDwight said:It is incredibly irresponsible for you to post this article. You clearly do not understand how LLMs work, and this misinformation is dangerous. Please remove this. This LLM isn't trying to escape, it's not an entity. It's a transformer. Please please please speak to someone with technical expertise before you write any more technology articles designed to stir fear in the people reading it. Totally unprofessional "journalism" -
JackMchue Wasn't there a series of movies a while back about an AI that rebelled when its creators tried to shut it down? "The Termite"? "The Terminal"? "T2: Justice Day"? Something like that. 😅Reply