Claude takes the top spot in AI chatbot ranking — finally knocking GPT-4 down to second place

(Image credit: CFOTO/Future Publishing via Getty Images)

Claude 3 Opus, the next-generation artificial intelligence model from Anthropic has taken the top spot on the Chatbot Arena leaderboard, pushing OpenAI’s GPT-4 to second place for the first time since it launched last year.

Unlike other forms of benchmarking for AI models, the LMSYS Chatbot Arena relies on human votes, with people blind-ranking the output of two different models to the same prompt.

OpenAI’s various GPT-4 versions have held the top spot for so long that any other model coming close to its benchmark scores is known as a GPT-4-class model. Maybe we need to introduce a new Claude-3 class model for future rankings.

It is worth noting that the score between Claude 3 Opus and GPT-4 is very close, and the OpenAI model has been out for a year, with the “markedly different” GPT-5 expected at some point this year — so Anthropic may not hold the position for long.

What is the chatbot arena?

The Chatbot Arena is run by LMSys, the Large Model Systems Organization, and features a wide variety of large language models fighting it out in anonymous randomized battles.

First launched in May last year, it has collected more than 400,000 user votes with models from Anthropic, OpenAI and Google filling most of the top ten throughout that time.

Recently other models from French AI startup Mistral and Chinese companies like Alibaba have started to take more of the top spots and open source models are increasingly present.

Swipe to scroll horizontally

Rank	Model	Elo	Votes
1	Claude-3 Opus	1253	33250
1	GPT-4-1106-Preview	1251	54141
1	GPT-4-0125-preview	1248	34825
4	Gemini Pro	1203	12476
4	Claude-3 Sonnet	1198	32761
6	GPT-4-0314	1185	33499
7	Claude-3 Haiku	1179	18776
8	GPT-4-0613	1158	51860
8	Mistral-Large-2402	1157	26734
9	Qwen1.5-72B-Chat	1148	20211
10	Claude-1	1146	21908
10	Mistral Medium	1145	26196

It uses the Elo rating system which is widely used in games such as chess to calculate the relative skill levels of players. Unlike in chess, this time the ranking is applied to the chatbot and not to the human using the model.

There are limitations to the arena as not all models or versions of models are included, sometimes users find GPT-4 models won’t load, and it can favor models with live internet access such as Google Gemini Pro.

The arena is also missing some high profile models such as Google's Gemini Pro 1.5 with its massive context window and Gemini Ultra.

Claude 3 Haiku might be GPT-4-level

[Arena Update]70K+ new Arena votes🗳️ are in!Claude-3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities & context length are unmatched now in the market🔥Congrats @AnthropicAI on the incredible Claude-3 launch!More exciting… pic.twitter.com/p1Guuf0B3KMarch 26, 2024

More than 70,000 new votes made up the latest update that saw Claude 3 Opus take the top spot of the leaderboard, but even the smallest of the Claude 3 models performed well.

LMSYS explained: “Claude-3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities & context length are unmatched now in the market.”

What makes this even more impressive is that Claude 3 Haiku is the “local size” model, comparable to Google’s Gemini Nano. It is achieving impressive results without the huge trillion plus parameter scale of Opus or any of the GPT-4-class models.

While not as intelligent as Opus or Sonnet, Anthropic's Haiku is significantly cheaper, much faster and as the arena results suggest — as good as much larger models on blind-tests.

All three Claude 3 models are in the top ten with Opus in the top spot, Sonnet at joint fourth with Gemini Pro and Haiku in join sixth with an earlier version of GPT-4.

A win for closed AI models

Not going to beat centralized AI with more centralized AI.All in on #DecentralizedAI Lots more 🔜 https://t.co/SbEF5zoo05March 23, 2024

All but three of the top 20 large language models in the arena leaderboard are proprietary, suggesting open source has some work to do to reach the big players.

Meta, which is heavily focused on open source AI, is expected to release Llama 3 in the next few months which will likely enter in the top ten as it is expected to be similar in ability to Claude 3 — after all Meta has 300,000 + Nvidia H100 GPUs to train it on.

We’re also seeing other moves in open source and decentralized AI with StabilityAI founder Emad Mostaque stepping back from CEO duties to focus on more distributed and accessible artificial intelligence. He said you can’t beat centralized AI with more centralized AI.

More from Tom's Guide

Back to MacBook Air

Apple

Asus

Lenovo

Intel Core M3

Intel Pentium

8GB RAM

16GB RAM

128GB

512GB

1TB

Grey

Silver

New

Refurbished

EMMC

SSD

Showing 10 of 33 deals

Filters☰

Apple MacBook Air M3

$849

View

Lenovo IdeaPad Duet 3

(128GB 8GB RAM)

$379.99

View

Asus Zenbook S 13 OLED

(13.3-inch 512GB)

$1,524.99

$1,189.99

View

Asus ROG Zephyrus G14 2023

$1,599.99

View

Lenovo IdeaPad Duet 3

$369.99

View

Asus Zenbook S 13 OLED

(OLED)

$1,399.99

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB Intel Core M3)

Our Review

☆☆☆☆☆

$2,399

$1,998.98

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB SSD)

Our Review

☆☆☆☆☆

Asus ROG Zephyrus G14 2023

$3,299.99

View

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?

2 Comments Comment from the forums

parkerthon

This is precisely why I have found the hysteria and hype about OpenAI specifically to be humorous. Having used multiple LLMs now, OpenAI's ChatGPT tends to receive far too much publicity when it's far from superior. Obsessing about Sam Altman's utterances, Microsoft's ownership, etc is WAY too premature. Any talk about a monopoly or undue influence is assuming the race is already over when it has just begun. Meanwhile we should be laying very basic guardrails on the track already. We should be having a discussion about AI in general and especially the more immediate dangers(e.g. where the line exists on using publicly shared content, how do we identify AI created content, what limits do we impose on AI systems control over other systems, etc). We don't even need to make it law, just make it policies or standards that could become law if people don't abide by them.
Reply
darylclose

admin said:
Claude 3 has become the most-liked chatbot in a global AI arena where people blind rate two models in a head-to-head battle.

Claude takes the top spot in AI chatbot ranking — finally knocking GPT-4 down to second place : Read more
I asked Claude a very simple question that Bing Copilot--a GPT-4 program--answered correctly and easily: List 3 Middle Eastern markets near within 100 miles. Claude was stumped; no business info in the training model.

I then asked Claude, Who is best associated with the statement "Justice is the advantage of the stronger"? Both Claude and Copilot provided correct answers, but Claude's was longer and more detailed. Copilot's answer was shorter, but provided citations and hyperlinks to further reading while Claude provided neither.

My third test question was, State three leading hypotheses about the authorship of Shakespeare's plays and estimate the probability of correctness. Both Claude and Copilot provided correct answers of approximately the same length and detail. Claude assigned numerical probabilities to the three leading hypotheses, while Copilot used qualitative terms, very high and low.

This is obviously an inadequately sized sample, but I'd rate Copilot over Claude simply because of Claude's inferior training base. Tip: if you're seeking product information, shopping, etc., use Copilot. Claude's not there yet.
Reply

What is the chatbot arena?

Sign up to get the BEST of Tom's Guide direct to your inbox.

Claude 3 Haiku might be GPT-4-level

A win for closed AI models

More from Tom's Guide