OpenAI knocks Gemini off the top of chatbot leaderboard with its new model
Robot wars.
OpenAI's ChatGPT and Google Gemini have been duking it out for your chatbot prompts for months, but the competition is really starting to heat up.
While Claude took the top spot on AI benchmarking tool LMSys Chatbot Arena earlier this year, Gemini had been reigning supreme.
Now, though, a new version of ChatGPT-4o (20240808) has reclaimed the lead from its rivals with a score of 1314 — 17 points ahead of Gemini-1.5-Pro-Exp.
This comes just a day after Google mentioned its lead on the arena board during its Made by Google keynote.
As per lmsys.org on X, "New ChatGPT-4o demonstrates notable improvement in technical domains, particularly in Coding (30+ point over GPT-4o-20240513), as well as in Instruction-following and Hard Prompts."
We called it
Exciting Update from Chatbot Arena!The latest @OpenAI ChatGPT-4o (20240808) API has been tested under "anonymous-chatbot" for the past week with over 11,000 community votes.OpenAI has now successfully re-claimed the #1 position, surpassing Google's Gemini-1.5-Pro-Exp with an… https://t.co/9lJlASI9UW pic.twitter.com/gxCDuBOi9NAugust 14, 2024
We spotted recently that OpenAI had rolled out a new version of GPT 4o in ChatGPT, and a different but similar model arrived for developers yesterday, too — the same day the Chatbot Arena results were revealed.
In our testing, we found it to be much snappier than prior versions, even building an entire iOS app in an hour using the latest version of the model.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Here at Tom’s Guide our expert editors are committed to bringing you the best news, reviews and guides to help you stay informed and ahead of the curve!
That, paired with improvements to the Mac app, means it's been a bigger week than usual for ChatGPT users and OpenAI itself.
Still, with new models and revamped ones arriving all the time, there's every chance we'll see a reshuffle at the top of the pile in the coming months — or even weeks.
We have yet to see the launch of Google Ultra 1.5 or Claude Opus 1.5 and xAI's Grok 2 has made its first appearance in the top ten.
More from Tom's Guide
A freelance writer from Essex, UK, Lloyd Coombes began writing for Tom's Guide in 2024 having worked on TechRadar, iMore, Live Science and more. A specialist in consumer tech, Lloyd is particularly knowledgeable on Apple products ever since he got his first iPod Mini. Aside from writing about the latest gadgets for Future, he's also a blogger and the Editor in Chief of GGRecon.com. On the rare occasion he’s not writing, you’ll find him spending time with his son, or working hard at the gym. You can find him on Twitter @lloydcoombes.
-
Araki It doesn't really matter, I'll be using gpt-4o and gpt-4o-mini via API even if they would be below all these Geminis and Claudes in benchmarks. Simply because OpenAI's models are least censored and don't have these biases they for some reason absolutely must push to the user. What's the point of unmoderated endpoints of Google and Antrophic if their models will break through the instruction and write 3 paragraphs, wasting paid tokens and computational resources, telling me how absolutely unsafe it is to roleplay as an anime catgirl?Reply
At least until the next open source model that will rule the scene for 1 day before getting overshadowed by a new proprietary model. -
Iamhe02 Gemini is absurdly, comically censored. Try asking it, "Who was the first president of the United States?" Then, ask yourself if you'd be willing to pay $20/month for a model that can't handle such a basic, uncontroversial query.Reply -
Araki
It's so Google to write "Avoid creating or reinforcing unfair bias" in their Google AI Principles and then finetune their model on numerous unfair biases they personally believe in. Or I guess I should say "fair biases"? Such a hypocrisy.Iamhe02 said:Gemini is absurdly, comically censored. Try asking it, "Who was the first president of the United States?" Then, ask yourself if you'd be willing to pay $20/month for a model that can't handle such a basic, uncontroversial query.
Most people are far away from the technology behind all that "AI stuff" and actually believe things LLMs say without double-checking, so using a product they developed to force their own beliefs onto the general public should straight-up be illegal.