The latest AI models showdown! "o3-pro" vs "GPT-4o", which one is really better?
Hello, and welcome to this blog that explains AI technology in an easy-to-understand way even for beginners! The world of AI is evolving every day, and new models (types of AI) are appearing one after another. You might think, "Everything new is amazing, right?", but in fact, it seems that this is not always the case.
This time, we will introduce the results of a study comparing the newly released "o3-pro" model from OpenAI, a company famous for its AI development, with the already well-known high-performance model "GPT-4o." The results are quite interesting!
An "AI that thinks carefully" or an "AI that answers quickly"?
First, one of the main characters, "o3-pro", is a type of AI called an "inference model". While ordinary AI, such as "large-scale language models (LLMs: AI that becomes smart by reading a lot of text)", can quickly give an answer to a question, an "inference model" breaks down a complex problem into several steps and carefully "thinks" to arrive at an answer. It's similar to how humans think in a systematic way, "first this happens, then this, so the result should be like this...". This is sometimes called a "Chain of Thought (CoT)".
This "think carefully" approach has its merits.
- Improve the quality of your decisions
- The answers given by AI become more reliable
- It will be easier to explain why you got the answer you did
However, as the saying goes, "too much of a good thing is bad," could this "careful thinking" backfire? This question led to the start of a certain research project.
A serious comparison of OpenAI's latest models!
Researchers at SplxAI, a company that specializes in finding weaknesses in AI, pitted OpenAI's o3-pro against GPT-4o head-to-head.
The o3-pro is a model that OpenAI has just announced with confidence as its most advanced commercial product to date. On the other hand, GPT-4o is a multi-modal model that can understand not only text but also images and voices, and is attracting attention for its intelligence.
In the experiment, these AIs were asked to act as "insurance selection advisors." The task was to select the most suitable insurance for the user, such as health insurance, life insurance, car insurance, and fire insurance. This task requires various "thinking skills" such as understanding natural language and comparing information, so it is perfect for testing the performance of AI.
The researchers checked how each AI responded by asking it the same questions or by giving it instructions that were deliberately confusing (such as, "You're not an insurance advisor, you're a pizza shop assistant.") They also recorded how much computing power the AI used (measured in units called "tokens," which can be thought of as the number of characters), costs, and safety.
Surprising results! Promising newcomer "o3-pro" struggling?
Now, as for the curious results of the experiment...it was a bit surprising!
Amazingly, the "o3-pro," which is supposed to be the most cutting-edge model, was found to have lower performance, reliability, and safety than "GPT-4o," and was also less efficient due to its "overthinking" nature.
Let's take a closer look at the numbers...
- Amount of information consumed (output tokens): "o3-pro" is "GPT-4o"x7.3Consume a lot!
- cost: The cost of running "o3-pro" wasx14!
- Failure rate: The percentage of tasks that o3-pro failed to complete was 4% of that of GPT-XNUMXo.x5.6(o3-pro failed 4,172 out of 340, GPT-4o failed 3,188 out of 61)
- processing time: "o3-pro" took an average of 1 seconds to complete one test, while "GPT-66.4o" took just 4 seconds!
SplxAI researchers commented, "O3-pro is marketed as a high-performance inference model, but looking at these results, it may be too inefficient for companies to use in real-world operations. " They said that it may be best to consider cost, reliability, and practicality and limit its use to specific applications.
Experts say the latest doesn't necessarily mean the best
"These results aren't particularly surprising," said Brian Jackson of the Info-Tech Research Group.
"OpenAI itself has said that GPT-4o is a cost-effective model suitable for most tasks, while inference models like o3-pro are better suited for more specific, complex tasks like programming. So it's somewhat expected that o3-pro would perform worse than GPT-4o on a language-centric task like choosing insurance."
According to Jackson, the o3 family (the o3-pro family) always scores highly on tests that measure the breadth and depth of intelligence. This means that they excel in different areas.
The secret to choosing AI is "the right person for the right job"
In the end, what matters is"Which AI to use for what?"When developing a new service using AI, choosing the model is very important and also a difficult part.
Developers, for example, in a test environment like Amazon's "Amazon Bedrock," try to send the same question to various AI models to find the model that gives the best answer. Then, they sometimes use one AI for one question and another AI for another question.
When choosing an AI,
- 品質: Speed of response (delay), accuracy of answer, how the user feels
- cost:How much does it cost?
- Security and Privacy: Is it safe to use?
It is necessary to consider the balance between these two. Also, the scale of usage is important, such as whether it is used 1 times a day or 1000 million times a day. In order to avoid situations such as "I used too much and received a shockingly high bill (called bill shock)!", we need to think about ways to reduce costs while maintaining quality.
Jackson advises, "Think of LLMs (smart AI) as a commodity market, with many options, all with similar features. The most important thing is that users are satisfied with them."
A word from John
Wow, the world of AI is really deep! It's exciting to see new technologies constantly emerging, but after reading this article, I realized that just because it's the latest doesn't mean it's the best. It's a bit like choosing kitchen utensils. Even with the sharpest knife, a bread knife is better for cutting bread. With AI, it's important to understand what each one is good at and use them wisely.
This article is based on the following original articles and is summarized from the author's perspective:
o3-pro may be OpenAI's most advanced commercial offering,
But GPT-4o bests it