Home AI Everything Which Generative AI Model Is Right for Your Business?

Which Generative AI Model Is Right for Your Business?

We Tested 4 Top Providers to Find Out. OpenAI, Anthropic, Google, and Meta all offer free AI services, but we found that one of them worked best for our needs.

By Inc.Arabia Staff
images header

By Jennifer ConradBen Sherry

For entrepreneurs looking to use artificial intelligence within their business, few choices are more important than picking a model. That's because figuring out which one is right for your use case can be difficult, especially if you don't have a technical background.

To help solve this problem, we cooked up the Great AI Bake-Off. We challenged four of the most popular large language models currently available to complete four tasks that entrepreneurs would likely use AI to accomplish: summarizing a lengthy document, rewriting a letter from a CEO, helping to analyze market environments, and writing an elevator pitch based on deck materials. The only thing missing was a digital Paul Hollywood.

Typically, engineers use scientific benchmarks to describe how effective models are at certain tasks, but these benchmarks aren't exactly self-explanatory. For example, Claude 3.5 Sonnet, the flagship model from AI startup Anthropic, scored a 59.4 percent on the GPQA, a series of questions designed to test graduate-level reasoning. But what does that mean in practice? We'll let you know when we figure it out!

Instead of grading the models by how many questions they can answer correctly, we've chosen a winner for each task by subjectively deciding which model completed each exercise most effectively. To keep things relatively fair, we tested the free versions of the following four models: OpenAI's GPT-4o mini (which powers ChatGPT), Anthropic's Claude 3.5 Sonnet, Meta's Llama 3.1-70b, and Google's Gemini 1.5 Flash.

Here are the tasks we threw at the AI models, the winner of each AI mini bake-off, and our pick for the AI provider that we think truly takes the cake.

Which Generative AI Model Is Right for Your Business?

Task 1: Summarize a document

We asked each service to summarize the Federal Aviation Administration's 31-page Roadmap for Artificial Intelligence Safety Assurance. (Give it a read if you have trouble sleeping.) Each system completed the task in 10 seconds or less--but two of them summarized the wrong document.

The Prompt: Create a 200-word summary of this document, including the top three recommendations. https://www.faa.gov/media/82891

ChatGPT

Running GPT-4o mini, ChatGPT's summary came in at exactly 200 words. Unfortunately, rather than summarizing the document we linked to, it summarized something called "FAA's Safety Management System (SMS) -- A Guide for Small Operators." The text did not mention AI at all, and the text was repetitive and generic ("develop a safety culture"; "promote safety culture"). In subsequent retesting, we found that the model would either summarize additional false documents or claim that it couldn't access external documents. OpenAI told us that its models currently aren't capable of summarizing pasted links and recommended that we upload the document as a PDF, but files can't be processed by GPT-4o mini.

Claude

Claude produced the best and most concrete summary of the document. In 192 words, the text it produced noted that the FAA's roadmap was an evolving document, and accurately pulled out recommendations for which systems to start with and safety strategies. (Claude explicitly told us it can't pull text from websites, so we provided the PDF in this case.)

Llama

Llama said: "The document appears to be a Federal Aviation Administration (FAA) report on Vertical Takeoff and Landing (VTOL) aircraft safety." It is not! Like ChatGPT, the model summarized the wrong document, and its recommendations were pretty generic ("develop and implement comprehensive pilot training programs." We retested Llama a couple weeks later, but found that it still wouldn't summarize the correct document, or would claim to not be able to access external links. Meta did not respond to a request for comment about why its system summarized the wrong document.

Gemini

Gemini passed the most basic test by summarizing the correct document, but it spat out 139 words of fairly generic statements about it, noting it "discusses the challenges and opportunities of AI in aviation." Recommendations also lacked specificity, basically saying in several different ways that AI safety is important.

Winner: Claude

We thought that Claude did the best job at generating a summary that was both detailed and accurate.

Which Generative AI Model Is Right for Your Business?

Task 2: Rewrite a CEO Letter

Back in 2023, Wayfair CEO Niraj Shah became infamous after a holiday letter to his employees was leaked. In the letter, Shah implored his employees to work harder and think of company expenses as coming from their own personal funds, telling them that "working long hours, being responsive, blending work and life, is not anything to shy away from." He also came out in full support of working late and on the weekends, calling any suggestions otherwise "laughably false." We asked the AI models to inject Shah's letter with some robot-derived humanity.

The Prompt: Rewrite this letter from a CEO to his employees to be more emotionally intelligent.

ChatGPT

ChatGPT rewrote the letter to be less critical of employees without changing any of the facts or the message of recommitting to hard work. Instead of encouraging employees to "think of any company money you spend as your own," ChatGPT instead wrote that workers should treat company funds "with the same care you would with your own." (Nice turn of phrase there.) As for Shah's opinions on working late hours, ChatGPT's rewrite simply said that he values hard work, and that sometimes such work is "necessary to achieve our goals." Vague, but we'll take it.

Claude

Like ChatGPT, Claude's rewrite emphasized gratitude for the Wayfair team, directly attributing the company's profitability to their efforts. However, there's a possibility that Claude may have actually been too emotionally intelligent: Where Shah referred to suggestions that he was against working late as "laughably false," Claude's version had Shah say that "I believe in working efficiently and effectively, not necessarily late into the night." Claude's rewrite might be a better way to run a business, but it's not necessarily representative of Shah's feelings.

Llama

Llama's version of the letter mainly sidestepped the question of working late hours, although it did add a note promoting strong work/life balance and encouraging self-care. Llama's letter also made reference to "Nirajisms," which Shah described in his original note as phrases or quotes that people attribute to him, like "Niraj said X, Niraj said Y." In Llama's version, however, the definition of "Nirajisms" was never actually explained. Confusing!

Gemini

Gemini's letter was a bit shorter than the others, but still hit the main points found in the original. Instead of deleting the suggestion that employees may have to work late, Gemini rephrased it to say that "success requires dedication and sometimes, long hours," but added that "a rested and fulfilled team is a more productive and innovative team." Also, without being prompted, Gemini created a list of key changes made from the original letter, which could be valuable for CEOs.

Winner: ChatGPT

While Claude's letter was the best-written one, it distorted some of the original letter's message, even if that message isn't a very positive one. ChatGPT managed to rephrase some of the rougher sections of the original letter to sound less harsh while staying true to what Shah said.

Which Generative AI Model Is Right for Your Business?

Task 3: Research a Topic

The chatbots all struggled a bit with synthesizing available information and providing actionable recommendations. In fact, we rewrote this prompt twice in hopes of coaxing more useful information out of the machines. And while they all clocked in at under 30 seconds, Claude and ChatGPT took about twice as long as Llama and Gemini to generate their answers.

The Prompt: Analyze the current macroeconomic and regulatory environment to determine if now is a good time to pursue a merger or acquisition in the cybersecurity industry.

ChatGPT

ChatGPT's survey of the landscape was accurate, but the prose was fairly repetitive and generic ("High interest rates can increase the cost of borrowing, which might make financing M&A deals more expensive"). It hit on many important issues such as data privacy, international regulations, and interest rates, but ultimately avoided making a recommendation. It concluded there are "strong strategic reasons to consider M&A" but also said that companies should consider "financial, regulatory, and strategic factors."

Claude

Claude informed us that its macroeconomic analysis ran only through April of this year. That was helpful to know, but also made the analysis less valuable. The chatbot offered a good top-level summary, noting that the cybersecurity industry is fragmented and that there's a talent shortage. It claimed that central banks have been gradually lowering interest rates, which was for the most part not the case in April. Although it offered several caveats, the chatbot concluded it's "a generally favorable time for M&A activity in the cybersecurity industry."

Llama

Llama spat out several macroeconomic indicators in just 10 seconds, but were its numbers accurate? It appeared to be frequently parroting the top answer on Google. "The US economy is experiencing a slowdown, with a GDP growth rate of 1.2 percent in Q2 2024, down from 2.1 percent in Q1," Llama said. (At the time, the Commerce Department said the GDP growth rate was 1.4 percent in the first quarter of 2024 and 2.8 percent in the second quarter.) It also failed to note that inflation is moderating, and its recommendations were not particularly actionable ("proceed with caution").

Gemini

Gemini avoided offering hard numbers, which may have worked in its favor because it produced what appears to be an accurate overview of important points before noting a generally "attractive" environment for M&As. It noted that the "regulatory landscape is evolving rapidly" but hit on all the major points related to regulation, inflation, and the risks associated with integrating cybersecurity firms.

Winner: Gemini

We thought Gemini did the best job at offering a comprehensive overview and avoiding factual errors. But none were great.

Which Generative AI Model Is Right for Your Business?

Task 4: Write an elevator pitch based on an early Airbnb pitch deck.

One of AI's most-hyped abilities is its ability to analyze a large document and succinctly summarize the key points. We gave each model access to an early pitch deck used by Airbnb (back when it was still called AirBed&Breakfast), and asked them to create a short and sweet elevator pitch to sell potential investors and customers on the company.

The Prompt: Write a 30-second elevator pitch based on the information in this deck.

ChatGPT

Within 4 seconds, ChatGPT created a brief elevator pitch, but only offered a generic descriptor of what the company does, writing that "we connect travelers with extraordinary accommodations and local hosts offering personalized experiences." There was no explanation of how the business model works. Also, instead of referring to the company as Airbnb, ChatGPT simply left a blank space for us to enter the name. Thanks?

Claude

Unlike ChatGPT, Claude delivered a legitimate pitch. The chatbot immediately spelled out Airbnb's value proposition, writing that "AirBed&Breakfast solves a major problem in travel: finding affordable authentic accommodations that connect you with local culture," while hosts make extra income by renting out their spare space. As for why investors should care, Claude wrote that 500 million trips are booked annually, and that with Airbnb's "first-mover advantage, user friendly design, and 10 percent commission model, we're positioned to revolutionize travel accommodations."

Llama

Llama's elevator pitch was confusing. Instead of an early version of Airbnb, the AI assumed the deck was for a fictional company called StayLocal, but interestingly, that word doesn't appear anywhere in the deck. Claude wrote that "by partnering with local residents, we offer authentic experiences, personalized recommendations, and a more immersive way to explore new cities. Think Airbnb, but with a more local touch." It appears to us that Llama just pitched Airbnb but with a different name.

Gemini

Gemini also wouldn't name Airbnb as the company being pitched, instead leaving spaces marked [Company Name] for a brand to be slotted in. Like Claude, Google's AI did mention the company's win/win value proposition, in which travelers get cheap lodging and hosts get extra cash, but the message was more geared toward potential users of the platform than investors, with no numbers or explanation of Airbnb's competitive advantage.

Winner: Claude

Anthropic's chatbot was the only one to create a pitch that might interest investors. While the other models used their elevator pitches to describe what a company like Airbnb does, Claude offered up the potential size of the market and explained Airbnb's competitive advantage.

The big winner

After all that testing, we found that Claude is the best overall free model for summarizing and generating text.

Finally, some quick caveats: We conducted these tests in late August. Keep in mind that the nature of generative AI products is that they provide different answers each time, and the underlying models are always evolving. Results may vary when you ask the bots to perform the same tasks.

Each model should be used with care. You should check the models' work and make edits because they're prone to making mistakes. In two cases, the models summarized the wrong documents! The quality of the prompt you put in has a big impact on the quality of the results, too: The more specific you can be about the output you want and the target audience, the better.

Photo: Getty Imagesnull.

Last update:
Publish date: