ChatGPT, Gemini and Claude all failed to solve a simple test that humans are acing

(Image credit: Shutterstock)

As artificial intelligence continues to build on its reputation as the smartest thing in the room, it will be oddly therapeutic to hear that one test has it stumped.

In fact, this new AI examination system is causing issues for even the most advanced models.

ARC-AG2, or to use its more glamorous name, "the Abstraction and Reasoning Corpus", is a new test developed to measure an AI model’s reasoning and general problem-solving.

It was created by a non-profit called The ARC Prize, which exists to accelerate the development of Artificial General Intelligence (AGI) — something OpenAI founder Sam Altman has claimed could arrive as soon as this year.

Deepseek’s R1 model scored just 1.3% on the new test and other similar models like Google’s Gemini or Claude’s 3.7 Sonnet scored around 1%. ChatGPT's GPT 4.5 model likewise scored 0.8%.

So what are they being tested on that is so hard?

What's the test?

The test itself included puzzle-like problems where the AI model had to identify visual patterns from a collection of colored squares. Once the pattern is identified, the model then has to select the correct answer.

It’s a bit like learning grade-school math problems. You cannot simply memorize your way to the answer. Instead, the tasks require a model to apply existing knowledge and models of understanding to completely new problems.

By doing this, the test doesn't just look at intelligence as the ability to solve problems or get the highest score. Instead, it's looking at how efficiently AI can adapt, learn, and solve new problems on the fly.

This kind of test is designed to force the AI to solve problems it has never seen before, having to acquire new skills that are outside the data they were trained on.

This kind of test is designed to force the AI to solve problems it has never seen before, having to acquire new skills that are outside the data they were trained on. Unlike some previous tests, the aim here is to provide something that is easy for humans to complete but hard for AI.

Over 400 people were actually asked to take the same test. On average, this human “panel” scored an average of 60% — far exceeding even the best-performing AI models.

ChatGPT and Deepseek side by side on smartphones — (Image credit: Getty/Shutterstock)

This is where the team behind the test believes we should be testing AI. While the likes of ChatGPT, Gemini, and Claude can all outperform humans in a variety of tasks, there are still plenty of areas where humans are better.

As the name suggests, this isn’t the first version of this test. In 2019, a Google employee created ARC-AG1. This took AI four years to beat and showed the eventual advancement in reasoning for these models.

While it could well take the models a few more years to solve this newer test, the team behind it believes it is an important measure to aim for.

Once there are no tasks that are easy for humans but hard for AI left, they believe we will have achieved artificial general intelligence — a version of AI that exceeds human capabilities in all measures.

More from Toms Guide

Alex is the AI editor at TomsGuide. Dialed into all things artificial intelligence in the world right now, he knows the best chatbots, the weirdest AI image generators, and the ins and outs of one of tech’s biggest topics.

Before joining the Tom’s Guide team, Alex worked for the brands TechRadar and BBC Science Focus.

In his time as a journalist, he has covered the latest in AI and robotics, broadband deals, the potential for alien life, the science of being slapped, and just about everything in between.

Alex aims to make the complicated uncomplicated, cutting out the complexities to focus on what is exciting.

When he’s not trying to wrap his head around the latest AI whitepaper, Alex pretends to be a capable runner, cook, and climber.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

What's the test?

Sign up to get the BEST of Tom's Guide direct to your inbox.

More from Toms Guide

You must confirm your public display name before commenting