Nobody knows how ChatGPT thinks — but OpenAI says it’s closer to cracking the mystery
Creators of chatbots can tell you how they've trained them. But then it starts to get complicated.
Creators of AI chatbots like ChatGPT can explain how they train them and even how the underlying technology works but then can’t fully explain what their creations do with that information they've been trained on.
It is an important issue to solve, as often AI developers are taken by surprise at what their creations can do — and can't do. For example the Udio team created an AI music model but found it could write and perform standup comedy.
Even the leaders in the field struggle to grapple with how to work out what LLMs and other frontier models are doing with the information, but it seems OpenAI has made a first step into decoding this mystery.
While a lot remains unknown, OpenAI researchers have found 16 million features in GPT-4 which they say reveal what the model is ‘thinking’ about.
What have OpenAI discovered?
They did so using a technology called sparse autoencoders, which are like machine learning models which can identify the ‘more important’ features. This is opposed to other types of autoencoders that consider all features, making them less useful.
Say you’re discussing cars with a friend. You still have knowledge about how to make your favorite dish but that concept is not very likely to come up in the car discussion.
OpenAI said sparse autoencoders find out which are the more useful set of features or concepts important to generate an answer to a prompt. Similar to the smaller set of concepts a person relies on in any particular discussion.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Here at Tom’s Guide our expert editors are committed to bringing you the best news, reviews and guides to help you stay informed and ahead of the curve!
However, while sparse autoencoders can find features in a given model, that’s only one step towards interpreting it. More work is needed to understand how a model fully uses those features.
OpenAI thinks this work is important because understanding how models work means they can find better ways to approach model safety.
One part of a bigger picture
We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts.… pic.twitter.com/UFP0EfEKSLJune 6, 2024
Another challenge is the training of sparse autoencoders which is made complex for various reasons including requiring more computational power to handle the necessary restrictions and avoiding overfitting.
However, OpenAI says it developed new state-of-the-art methodologies which allow it to scale out sparse autoencoders to tens of millions of features on frontier AI models such as GPT-4 or GPT-4o.
To check the interpretability of such features, OpenAI listed fragments of documents where these features activate. These included phrases related to price increases and rhetorical questions.
What happens next?
While it’s a first step that shows what large language models are focusing on, OpenAI also admits there are several limitations.
For starters, many of the features they discovered are still hard to interpret with many activating with no clear pattern. Furthermore, they also don’t yet have good ways to check the validity of interpretations.
In the short term, OpenAI hopes the features they found can help monitor and steer language models behaviors.
For the long term, OpenAI wants interpretability to provide new ways to reason about model safety and robustness. Understanding how and why an AI model works in the way it does will help people trust it when it’s making important decisions.
More from Tom's Guide
- What is ChatGPT? Everything you need to know
- Apple just unveiled new Ferret-UI LLM — this AI can read your iPhone screen
- AI models are getting better at grade school math — but a new study suggests they may be cheating
Christoph Schwaiger is a journalist who mainly covers technology, science, and current affairs. His stories have appeared in Tom's Guide, New Scientist, Live Science, and other established publications. Always up for joining a good discussion, Christoph enjoys speaking at events or to other journalists and has appeared on LBC and Times Radio among other outlets. He believes in giving back to the community and has served on different consultative councils. He was also a National President for Junior Chamber International (JCI), a global organization founded in the USA. You can follow him on Twitter @cschwaigermt.