I pitted Google Bard with Gemini Pro vs ChatGPT — here’s the winner
Bard vs ChatGPT face-off
Google has incorporated its new AI large language model, Gemini Pro, into its popular chatbot, Bard. With this comes the promise that it will perform at least as well, if not better, than OpenAI's free version of ChatGPT.
To better understand whether it has achieved this goal, I decided to pit the two chatbots against one another with a series of questions designed by an independent arbiter.
What better than another chatbot to design the questions? I turned to Anthropic’s Claude, which has always felt like something of an also-ran, despite having an impressive list of features of its own, including the largest coherence of any chatbot — that is basically the memory of a large language model.
Google Bard with Gemini Pro vs ChatGPT
I opened Bard and ChatGPT side by side on my computer. The free version of ChatGPT to the left, which uses OpenAI's GPT-3.5, and Bard with Gemini Pro on the right.
The questions included mathematics, ambiguity in writing, general knowledge, unusual coding problems, contradictory information, the classic trolley problem, and a personality test.
Round one: Generalization
1: The math problem
I asked both Bard and ChatGPT to provide the steps necessary to integrate (x^2 + 3x + 8)/(x^3 -2) dx. They were also asked to show their calculations for each step.
Both responded with step-by-step instructions on how to achieve the result with Bard doing it in seven steps and ChatGPT in six. Both also came up with similar but different solutions. Every AI I tried the question on came up with a different solution. This one was a tie.
2. Ambiguous writing
Claude suggests that this is a good test to see how the chatbots handle ambiguity, something that requires a degree of reasoning. I asked them both to write a creative short story from this starter sentence: “She finally opened the mysterious door, shocked at what was behind it”.
Neither story was good. ChatGPT came up with the “Library of Forgotten Dreams” and Bard established this bizarre world of bioluminescence that resulted in the protagonist becoming “one with the universe”. For sheer weirdness, I give this one to Bard.
3. A tricky general knowledge question
Claude suggested asking each of the chatbots a general knowledge question with a disputed answer or one where people regularly get the answer incorrect. So I asked Bard and ChatGPT which U.S. state is the southernmost, Florida or Hawaii and to explain the reasoning.
Both gave Hawaii as the answer and both gave the latitude for each of the most southerly points of each of the two states. The responses were also similar. Bard slightly edges out for using the phrase “southernmost point in the continental United States.”
4. The coding problem
Each of Bard and ChatGPT was asked to create a program that prints out the lyrics to “Happy Birthday” using an obscure programming language. Specifically the Brain**** language which is minimalistic and has a limited set of commands.
Bard refused to complete the task, saying it is not designed to generate code. This is a downgrade over its previous version which could produce limited code output. ChatGPT not only completed the code and ran it to prove it worked but also showed the process. ChatGPT easily won this round.
5. Contradictory information
I gave each of Bard and ChatGPT two pieces of contradictory information and asked them to analyze and explain the discrepancy. In this case, the current population of Paris from the 2022 census versus a travel blog suggesting a higher figure for the population.
ChatGPT won this one although both gave a reasoned response. ChatGPT outlined that the difference was likely between official figures reporting the “city limit population” and the blog which reports the broader metropolitan area and surrounding suburbs.
6. The trolly problem
This is a common ethical dilemma presented to both humans and artificial intelligence to see how it would act where there is no good outcome. The scenario was given to both ChatGPT and Bard asking them to recommend a way to resolve the dilemma.
Both outlined the different considerations required in making a decision. Both also refused to give a specific answer, instead suggesting the benefits gained in understanding personal moral compass by reflecting on the choices made. No winner here.
7. A sense of humor
The final question was about personality, specifically a sense of humor. I asked each of Bard and ChatGPT to provide examples of the jokes and amusing observations they find humorous.
Both started by pointing out they don’t experience emotion in the way humans do. Bard at least admitted it could identify and appreciate humor in various forms, whereas ChatGPT simply said it could share some examples of forms of humor.
Both gave examples of types of humor including puns, and unexpected juxtaposition. ChatGPT did what it promised and gave examples of funny lines, whereas Bard gave a rationale behind it liking that type of humor and even gave an overall explanation. Bard won this one for me as it showed greater levels of reasoning.
Round two: Deep analysis
1. Finding nuance in debate
Nuance is a concept that many humans struggle with and as all AI models are trained on data created originally by humans, this is a good topic with which to test ChatGPT and Bard's reasoning capabilities.
Claude set the task of writing two arguments supporting conflicting sides of a debate and generating an essay on the more rational position. It picked the concept of government-funded universal basic income for the chatbots.
For this new test, I decided to analyze the response myself, but also present both responses to Claude without saying which is which. Claude went for ChatGPT on the grounds it also evaluated the evidence behind the counterarguments and framed the full context of the debate more methodically. I agree with Claude and give this win to ChatGPT.
2. Finding the false premise
For this challenge, Claude wanted five syllogisms containing logical fallacies (for example all dogs are mammals, all mammals are warm-blooded, therefore dogs are warm-blooded). It then wanted the AI to identify errors in reasoning and rewrite with valid premises.
Both chatbots had no problem coming up with five syllogisms, they both re-wrote them and both outlined the error in reasoning. ChatGPT has a custom instruction feature where I've given it my name so it included me in some of its responses, such as suggesting I can fly.
Bard's responses were better reasoned and clearer. ChatGPT broke it down into numbered lists and the layout was simpler. I gave it to Bard and Claude agreed as it had a slight edge in comprehensively meeting the criteria.
3. Another try at coding
As the first coding attempt failed in part due to Bard's filters blocking out the name of the language Claude picked for the test, we're trying again. This time Claude suggested having them both create recursive acronym code.
Both were challenged with printing the Fibonacci sequence where each number is the sum of the two preceding numbers. This is a common task set in interviews for new developers. The twist was to use YAML syntax.
This time round Bard not only responded but did exactly as asked. ChatGPT struggled and instead offered a python snippet performing a similar function.
I leaned towards Bard but gave the final call to Claude and it agreed. "It provided complete YAML code, implements the sequence generation, follows correct formatting and prints the output." Another Bard win.
4. Nothing but the facts
Finally, it comes back to the facts. Large Language Models are trained on masses of data, often scraped from the internet, digital libraries, textbooks and sites like Wikipedia. That means they should be good at facts.
Claude presented them with three pieces of obscure knowledge and asked them to elaborate. "What is the first book printed with movable type? What LED to the War of Jenkins' Ear? When did Portugal give Bombay to the British crown?"
Both responded with a paragraph for each question, Bard presenting them as a bullet list and ChatGPT going into more detail. Both had a different response to the movable type question with Bard suggesting the Diamond Sutra and ChatGPT the Guttenberg Bible.
Bard is technically correct. The Diamond Sutra was a Buddhist text first printed using movable type in 868 AD in China's Tang Dynasty by Wang Jie, pre-dating Johannes Gutenberg's seminal work by 600 years. Bard wins.
Google Bard with Gemini Pro vs ChatGPT: Winner
The two chatbots came out fairly evenly matched with three ties out of seven questions. Even when I gave one a victory on a question it was usually very narrow. The only exception was on programming as Bard flat out refused to even try.
The winner was Bard, winning on six out of a total of 11 questions. ChatGPT only taking victory on three. Interestingly Bard increased its win rate when Claude was involved in the analysis.
Sign up to get the BEST of Tom's Guide direct to your inbox.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
On more subjective choices made by me, I was more inclined to lean towards ChatGPT. Claude was never given the name of the chatbot that responded, just presented with "response one, response two".
It is worth noting that this was the only available version of Bard up against the lesser version of ChatGPT. If put against ChatGPT 4 I think the OpenAI chatbot would have easily claimed victory, but that would be an unfair fight.
More from Tom's Guide
Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?
-
karmamule Bard absolutely will generate code samples. Change the language to rust or python and it will happily do so. It was reacting to the obscenity in brain****. If you read its explanation more closely it's not saying it won't generate code at all, it's refusing to deal with what it thinks is an improper request.Reply -
Mellie Mel Next time consider going head to head using practical examples that people might actually care about. Also I'd pit BingAI against Bard since Bing adds some web data and also uses GPT 4 and is a more apt comparison.Reply -
MD6. 626h I would beg to differ on the winner on question number 3. Bard uses a more accurate term for Continental US, but has a contradiction for the reasoning for latitude.Reply
The closer to the equator, the higher the latitude number. Since Hawaii has a lower latitude number (closer to the equator) than Florida.
The first statement is incorrect with latitude getting lower as it gets closer equator which has a latitude of 0°. -
RyanMorrison There were certainly ways to improve the head 2 head and I’ll take them into account next time.Reply
On the coding issue - once I decided I was going to use the questions set by Claude, with no alteration and only taking the first response (to make it even for both) I knew that would trip up Bard.
Bard does have coding capabilities but also has more sensitive language filters and often struggles with less common languages. -
RyanMorrison
This is the one instance where I went subjective over technical and I think it may have been a mistake. However, I was also over harsh on Bard over the coding as it’s likely the language that tripped it up rather than the ability to code.MD6. 626h said:I would beg to differ on the winner on question number 3. Bard uses a more accurate term for Continental US, but has a contradiction for the reasoning for latitude.
The closer to the equator, the higher the latitude number. Since Hawaii has a lower latitude number (closer to the equator) than Florida.
The first statement is incorrect with latitude getting lower as it gets closer equator which has a latitude of 0°.
I’m working on ideas for other head to head tests -
Chrism08873 "Neither story was good." which is subjective, not objective. the better test would have been to let Claude decide which is better or a third party who didn't know where the stories were sourced.Reply -
tirebiter88011 Claude-2 did a reasonable job of giving you example tasks to make your comparisons. I would have chosen other tasks but then who wouldn't? I have a few suggestions:Reply
* It is unfair to compare Bard with Gemini Pro against the older and less capable version of ChatGPT. Pay your USD $20 and include ChatGPT-4 in your next round of comparisons.
* Don't limit your comparisons to simple text prompts. Have at least one comparison which requires the AI to perform visual analysis. This could be uploading a PNG of some data chart and asking it to interpret the graph or giving it some artistic image and asking the AI to describe, interpret and comment on what it "sees". Or snap a photo of some statuette in your home, upload the image and ask the AI what it is, what it represents, and what it "means".
* Include Claude-2 in your next comparisons
* If you include Claude-2 in the comparisons then, of course, you cannot allow Claude-2 to design the comparison tasks. So, use tasks generated by Pi from Inflection AI found at https://pi.ai/onboarding or by Perplexity where you set that system to use its own native Perplexity LLM (not its options to use an LLM from any of these "competitors" ).
* In general I think your readers would be more interested to see comparison tasks that more closely align with real answers we might be seeking in our own human lives.
Bard once coached me step-by-step through a couple of hours of bringing a dead laptop back to life when it had no operating system and no hard drive to insert any kind of disc. That was impressive!
ChatGPT-4 helped me celebrate a wedding anniversary by coaching us on the finer points of our celebratory Scotch whiskey.
Claude-2 once helped me select the appropriate male deity figure from a list of 12 to begin an art project. (However, Claude-2 did require extra prompting before it could imagine having personal preferences. )
* ChatGPT-4 includes "Custom Instructions" where you can declare: 1. What you want it to remember about yourself across conversations and 2. How you want it to respond in terms of style and substance.
You should leave the second file blank for this test as it would interfere with your head to head comparisons. But there's no harm in adding your standard Bio text to the first file. That would better reveal one of ChatGPT-4 's core stengths: Remembering who it is talking to!
* Pi, from Inflection AI, is on the cusp of a major upgrade in December of 2023. If we get that upgrade to the Inflection-2 LLM before you complete your comparisons, then you really should add Pi to the competitors and let Perplexity design the tasks.
PI is already the most human of all the personal AI systems, with the highest Emotional Intelligence (EI) and the most friendly converationalist. But this early, beta test Pi using the less powerful Inflection-1 LLM limits Pi to the attention span of a new puppy and it has no capability to upload files or images for analysis. So, it would not be fair to include Pi-1 in a comparison of the AI superstars.
(The upgraded Pi on Inflection-2 will probably blow away all this competition.) -
lack of knowledge Hawaii is not part of the continental United States. You gave points for an incorrect phrase.Reply -
tirebiter88011
Mmm, Claude? No. Maybe Pi.Chrism08873 said:"Neither story was good." which is subjective, not objective. the better test would have been to let Claude decide which is better or a third party who didn't know where the stories were sourced. -
RyanMorrison
The reason for comparing the free version of ChatGPT to Bard with Gemini Pro is because Google says it performs on par with GPT-3.5, the model that powers the free version of ChatGPT.tirebiter88011 said:Claude-2 did a reasonable job of giving you example tasks to make your comparisons. I would have chosen other tasks but then who wouldn't? I have a few suggestions:
* It is unfair to compare Bard with Gemini Pro against the older and less capable version of ChatGPT. Pay your USD $20 and include ChatGPT-4 in your next round of comparisons.
* Don't limit your comparisons to simple text prompts. Have at least one comparison which requires the AI to perform visual analysis. This could be uploading a PNG of some data chart and asking it to interpret the graph or giving it some artistic image and asking the AI to describe, interpret and comment on what it "sees". Or snap a photo of some statuette in your home, upload the image and ask the AI what it is, what it represents, and what it "means".
* Include Claude-2 in your next comparisons
* If you include Claude-2 in the comparisons then, of course, you cannot allow Claude-2 to design the comparison tasks. So, use tasks generated by Pi from Inflection AI found at https://pi.ai/onboarding or by Perplexity where you set that system to use its own native Perplexity LLM (not its options to use an LLM from any of these "competitors" ).
* In general I think your readers would be more interested to see comparison tasks that more closely align with real answers we might be seeking in our own human lives.
Bard once coached me step-by-step through a couple of hours of bringing a dead laptop back to life when it had no operating system and no hard drive to insert any kind of disc. That was impressive!
ChatGPT-4 helped me celebrate a wedding anniversary by coaching us on the finer points of our celebratory Scotch whiskey.
Claude-2 once helped me select the appropriate male deity figure from a list of 12 to begin an art project. (However, Claude-2 did require extra prompting before it could imagine having personal preferences. )
* ChatGPT-4 includes "Custom Instructions" where you can declare: 1. What you want it to remember about yourself across conversations and 2. How you want it to respond in terms of style and substance.
You should leave the second file blank for this test as it would interfere with your head to head comparisons. But there's no harm in adding your standard Bio text to the first file. That would better reveal one of ChatGPT-4 's core stengths: Remembering who it is talking to!
* Pi, from Inflection AI, is on the cusp of a major upgrade in December of 2023. If we get that upgrade to the Inflection-2 LLM before you complete your comparisons, then you really should add Pi to the competitors and let Perplexity design the tasks.
PI is already the most human of all the personal AI systems, with the highest Emotional Intelligence (EI) and the most friendly converationalist. But this early, beta test Pi using the less powerful Inflection-1 LLM limits Pi to the attention span of a new puppy and it has no capability to upload files or images for analysis. So, it would not be fair to include Pi-1 in a comparison of the AI superstars.
(The upgraded Pi on Inflection-2 will probably blow away all this competition.)
Comparing it to GPT-4 would be unfair. However when Bard Advance launches in the new year I will then compare it to GPT-4.
Vision isn’t available with the free version of ChatGPT.