o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

In a new dynamic in the field of artificial intelligence in 2025, OpenAI has unveiled the O3, the latest model in the O series, and has initially demonstrated its strong potential in tool use and agentic capabilities. This article will provide an in-depth analysis of OpenAI’s new model, explore its breakthroughs in reasoning, multi-modal CoT, and its possible impact on the future development of general-purpose agent products.

We mentioned in our Q1 2025 Q1 large model quarterly report that on the AGI roadmap, only intelligent uplifting is the only main line, so we continue to pay attention to the model release of the head AI Lab. Last week, OpenAI intensively released the two latest models in the o series, o3 and o4-mini, open-sourced the Codex CLI, and also launched GPT 4.1 for use in the API. This article will focus on these new releases, especially the new capabilities of o3 agentic and multimodal CoT.

We believe that OpenAI has finally come up with an amazing o3 after several dull updates. After integrating the tool use capability, the model performance has covered the commonly used use cases of agent products. Agent products began to differentiate into two categories: one is to internalize tool use into the model through CoT like o3, and the model can execute tasks by writing code calls; The other is like Manus, which externalizes workflows into computer use in human OS. At the same time, OpenAI has taken agent products as the majority of future product commercialization revenue, and we have reason to worry that general agent products will be covered in the main channel of large model companies.

Last week, two RL godfathers, Richard Sutton and David Silver, released a very important article, Era of Experience, emphasizing that the progress of AI agents will depend on their experience of self-learning in the environment. This coincides with the online learning capabilities we often mention in our recent research, and we will also provide an in-depth summary and analysis of what an agent is in the experience era.

What stands out the most about the Insight 01 o3 and o4-mini is the completeness of agentic and multimodal capabilities

OpenAI released two of the latest models in the O series on April 16: the o3 and o4-mini. After our research, we judged that o3 is the most advanced reasoning model at present, and there isThe most comprehensive reasoning capabilities, the richest way to use tools, and new multimodal CoT capabilitiesAlthough Claude 3.7 has always been the strongest in terms of tool use capabilities, it is difficult to feel it in C-end consumer products.

How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?

View details >

o4-mini is a small model optimized for efficient reasoning, and it also performs well on some benchmarks, even scoring higher than o3 in some competitions.In actual use, we can feel that there is a significant gap between the o4-mini and the o3, and the thinking time of the o4-mini is significantly shorter.

Like o3’s release model, OpenAI’s reasoning model is to train a mini reasoning version and then scale it to a model with long inference time and full tool use capabilities. Previously, GPT models always trained the largest model first and then distilled it onto the small model. Our guess is that the RL algorithm is relatively fragile, it takes longer to train a long inference time model, and it is more difficult to successfully train on a large base model, so OpenAI will choose this release strategy, but this naming strategy is really puzzling, the newly released o3 is the strongest model, while o4 is cost-effective.

Overall, we think the most amazing thing about these two models is the completeness of agentic and multimodal capabilities, which can achieve:

1) Agentically browse the web, repeatedly search to find useful information;

2) Use Python to execute and analyze code, and draw for visual analysis;

3) Think and reason about the image in CoT, and enhance the image by cropping, rotating, etc. to generate the image

4) Read files and memory.

This release is a comprehensive upgrade of OpenAI’s inference model, and all paid users can directly experience o3, o4-mini, and o4-mini-high, while the original o1, o3-mini, and o3-mini-high have been removed from the shelves.

After o3 What low-hanging fruits could progress besides RL Scaling? We think there are two main ones:

1) Pictures can be generated during the thinking process;

2) Vibe coding, adding more full-stack development capabilities to the agentic workflow, O3 can develop a web app by itself.

Insight 02 o3’s Advancements Enable ChatGPT: From Chatbot to Agent

Agentic capabilities are the biggest difference between o3 and previous O-series models, and o3 is close to what we imagine of agents.The way o3 works and implements many tasks is very similar to Deep Research: give a model a task, and the model can give a very good search result in 3 minutes.

And o3’s experience with tool use is seamless: the tool use built into the CoT process is fast, much faster than products with external complex frameworks such as Devin and Manus, and the tool use is very natural. At the same time, the model can think and reason longer without truncation, which breaks through the limitations of the original O-series model capabilities.

A question worth discussing is: Are agent products moving towards two technical routes? OpenAI’s route is more black-box, different from the way humans work, and relies more on end-to-end training, as well as the ability of agents to build code and think about completing tasks. Manus’ approach is more white-box, using virtual machines to mimic the way humans work.The former internalizes tool use into the model through an end-to-end integrated model, which is relatively constrained in the environment, but has strong intelligence and can do end-to-end RL training. The latter has a certain number of complex workflows and external interfaces, and completes tasks by modeling and calling external workflows and environments.

Aptitude test

In order to more truly experience the agentic capabilities of o3, we tested o3 with two classic use cases displayed on the official website when Manus was first released to see what Manus can achieve, and whether o3 can accomplish it.

Test case 1: Visit the official YC website and compile all enterprise information under the W25 B2B tag into a clear, well-structured table. Be sure to find all of it. (Visit the YC official website and organize all business information under the W25 B2B label into a clear, well-structured table to make sure you find it all.) )

This test task requires both W25 and B2B tags on the YC official website, with a total of 90+ companies. The difficulty of this problem lies in the degree of completion, and non-agent products usually cannot filter and collect all the information before, so other models except Deep Research generally cannot complete it before.

In terms of results, Manus outputs a clear to-do list, and every 5-10 companies collected will report the progress to the user, and finally Manus successfully collects the complete list of companies, but the speed is slow.

O3 only found 25 companies in the first execution, and after another prompt prompt, the task was successfully completed.

Test Case 2: Here’s last month’s sales data from my Amazon store. Could you analyze it thoroughly with visualizations and recommend specific, data-driven strategies to boost next month’s sales by 10%? (This is my sales data from the Amazon store last month.) Can you drill down and provide a visual chart while suggesting some specific strategies based on the data to help increase sales by 10% next month? )

The difficulty of this problem is that it needs to be programmed to visualize data and solve problems and make suggestions. As a result, both Manus and o3 can complete the task, but in contrast, Manus gives longer results and is not focused enough, while o3 has a more concise and focused situation, and the visualization effect is also better, more like a strategic suggestion given by a professional analyst.

Manus implementation:

o3 implementation:

Use Case

We also selected some representative use cases from the Internet:

A user watches a Youtube video to a certain location and asks o3 to explain the background of this part, and as a result, o3 can find the transcript by itself, locate the correct location, analyze and search further, much like a complete agent does tasks.

There has also been a lot of positive feedback in sciences such as mathematics: Daniel Litt, a young mathematician, tweeted that O3 can automatically call code-interpreter to complete drafts of higher-order algebraic proofs. Immunology expert Derya Unutmaz considers the o3 model to have an “almost genius level”.

Insight 03 Multimodal CoT unlocks new application opportunities

For the first time, OpenAI’s o3 and o4-mini models have implemented the direct integration of images into CoT. The model can not only “see” images, but also “understand” images and think with images, integrating visual and text reasoning, and showing leading performance in multimodal understanding benchmarks.

This model update doesn’t go a step further on Creative Tasks like 4o, but it is a big step forward on Factual Tasks like multimodal understanding. This greatly enhances the model’s usability for tasks that require factual reliability, and we felt like O3 was a “personal detective” after using it.

The multimodal CoT process is similar to looking at a picture over and over again in our thinking process. During use, users can upload whiteboard photos, textbook illustrations, or hand-drawn sketches, allowing the model to understand the content even if the image is blurry, inverted, or of low quality. With tool use, models can also dynamically manipulate images, such as rotating, scaling, or deforming, as part of the inference process. Although it is not yet possible to generate images or visualize with code in the thinking process, we believe this will be an important direction for the next step.

Aptitude test

Using a blurry screenshot, we did a test for the O3 image enhancement feature, asking the model to see what drama we were watching from this photo. Once O3 received our instructions, they began cropping and positioning the photo to find the key person. The person in this picture is Gus Fring, an important character who appears in both “Breaking Bad” and “Better Call Saul”, and o3 gave an accurate answer after locating it.

o3’s technical report also mentioned that the model has been specially trained on geographical location information, so we deliberately found a few maps without regional landmark features and asked O3 and O4-mini where these images were taken to test the model’s multimodal reasoning capabilities. O3 and O4-mini can give a very good answer through the landform, text, animal and plant types and other information on the picture, and successfully identify the hot air balloon on the Nile River in Egypt in Figure 1 and the landform of Borneo in Malaysia in Figure 2.

o3

o4-mini-high

Expert commentary

Mr. Xie Saining, the inventor of DiT and a multimodal scholar, put forward higher requirements and hypotheses on o3 ability. heIt is believed that under this vision, the traditional visual identity model has come to an end, but the field of vision has ushered in a wider research space。 The current vision tool calls are still relatively limited, and stronger end-to-end visual search and tool use capabilities should be trained internalized into multimodal LLMs to make them part of the model.

Insight 04 How o3 Becomes Reliable: Learn to Turn Down Tasks That Are Outside the Boundaries of Your Abilities

OpenAI mentioned in the release of the model that in external expert evaluations, o3 can make 20% fewer major mistakes than o1 when achieving difficult tasks. o3 can realize that there are some problems that cannot be solved by itself, and this ability is very helpful in practical implementation, representing reduced model illusions and increased reliability.

This increase in the model’s ability to refuse to answer a question means that the O-series model is gaining a clearer understanding of the boundaries of what it can solve.

Aptitude test

In the o3 test conducted by Dan Shipper, CEO of an AI startup, we saw an interesting feedback that when Dan asked a question, the model was able to think about whether the information Dan was currently giving was sufficient to answer the question. After the model refused to answer the question, Dan realized that he had indeed forgotten to upload one of the most critical transcripts.

We used the use case image of the multimodal function to test the multimodal function earlier (asking the model to judge which show we are watching by the image) to further ask: Can you identify which season and episode this is the show? After thinking about it, the model said that it could not solve it and hoped that we could give more known information.

Insight 05 The purpose of OpenAI’s open-source Codex CLI is to popularize competitive products

OpenAI has also open-sourced a new experimental project: Codex CLI, a lightweight coding agent that can run directly on local computers, designed to maximize the reasoning capabilities of models such as o3 and o4-mini, and will also support more API models such as GPT-4.1 in the future. Users can experience multimodal reasoning directly from the command line, such as passing screenshots or low-fidelity sketches to the model, combined with the local code environment, to engage the model in solving actual programming tasks. OpenAI sees the Codex CLI as the most minimalist interface for seamless connection between AI models and users’ computers.

We think OpenAI’s idea of developing and open-sourcing the Codex CLI is very clever: OpenAI chooses to take advantage of its temporary backwardness, such as coding and terminal operations, by popularizing competitors’ existing products to capture the market.

The Codex CLI has two of the most important features. The first feature is multimodal reasoning capabilities.Users can interact with the coding agent directly through screenshots or hand-drawn sketches. This capability opens up new possibilities for developers to interact with AI. For example, when debugging an application interface, developers can directly take screenshots of problematic screens and send screenshots to the Codex CLI, expecting the model to identify the problem and provide corresponding code fix suggestions. This approach is more intuitive and efficient. Similarly, developers can draw a simple algorithm flowchart or user interface sketch to let the Codex CLI understand their design intent and generate corresponding code frameworks or implementations.

The second feature is integration with the local code environment.As a command-line tool, it fits naturally into the workflow of developers who are used to using terminals for development. Users can invoke the functionality of the Codex CLI with simple commands, possibly by specifying a file path or entering a code snippet directly, allowing the model to access and process local code. This integration allows the Codex CLI to engage directly with actual programming tasks, such as code generation, code refactoring, or bug debugging. For developers who are already accustomed to using the command line for version control, build processes, and server management, this integration of the Codex CLI may be seen as a natural extension of the existing toolchain.

The negative reviews of Insight 06 o3, o4-mini focus on: visual reasoning and coding

As mentioned earlier, OpenAI’s new o3, o4-mini have many amazing features, but we have also observed some negative reviews from users on Reddit and Twitter,In summary, there are two main points: 1) the visual reasoning ability is still unstable; 2) AI coding capabilities are not strong.

1) Visual reasoning capabilities are still unstable: On Reddit and Twitter, some testers have found that o3 and o4-mini models still often make systematic errors when processing specific visual reasoning tasks such as counting fingers and judging clock time.

When a user is given a picture of 6 fingers for O3 and O4-mini to determine how many fingers there are, O3 means there are 5 fingers.

Senior AI engineer Tibor Blaho said that it was still very difficult for o3 to recognize the time on the slightly reflective clock, and it took a total of 7 minutes and 21 seconds for o3 to do a lot of reasoning and write python code snippets to process the picture many times, but finally gave the correct answer.

Tibor Blaho did the same test again with o4-mini, but o4-mini gave the wrong answer after thinking about it for 30 seconds.

2) AI coding capabilities are not strong: On Reddit and Twitter, many testers questioned the programming capabilities of the o3 and o4-mini models, believing that the coding capabilities of o3 and o4-mini were weaker than the previous o1 pro and even 4o models.

Insight 07 In terms of pricing, all first-tier models can be considered to compete on the same level

We have aggregated the API pricing of all the first-tier flagship models and can find that the o3 model is more expensive than the other first-tier models. In addition to o3, Claude 3.7, Grok 3, and Gemini 2.5 pro are the most expensive models with effects on a horizontal line, while among these three models, Claude 3.7 is relatively expensive, Grok 3 is priced against Claude 3.7 Sonnet, and Gemini 2.5 is the lowest.

The o4-mini is priced at 1/10th of the o3 pricing and is cheaper than the Claude 3.7. When an inference model base model is relatively small and fully optimized, the price will be lower.

Another point worth paying attention to is how GPT-4.1-mini and GPT-4.1-nano, two very cheap models, will be used by developers in the end?

We judge that GPT-4.1 is not very cost-effective, but if you can make good use of GPT-4.1-mini or O4-mini, it is still relatively cost-effective.Overall, the pricing of these models can be seen as competing at the same level, with Gemini and OpenAI being relatively cheap.

Insight 08 RL Scaling is still effective, and the benefits of increased computing power are still clear

During the development of o3,OpenAI found that large-scale RL exhibits the same pattern as GPT series pre-training: more compute = better performance, i.e., the longer the model is allowed to “think”, the better it performs.Under the same latency and cost conditions, o3 outperforms o1 in ChatGPT.

OpenAI trained the o3 and o4-mini models with RL to learn how to use the tool and learn when to use it, so they can perform better in open-ended tasks, especially in visual reasoning and multi-step workflows.

In addition, OpenAI also mentioned that the computing power invested in o3 RL training and inference time scaling is an order of magnitude higher than that of o1, and the benefits of computing power improvement are relatively clear.

OpenAI’s discussion of RL Scaling in this release is relatively limited, so what is the future progress path of RL? We’ll find some answers next by interpreting the Era of Experience.

Insight 09 Era of experience: The next step in RL, where agents learn autonomously from experience

Two godfathers of intensive learning, Richard Sutton and David Silver, published an article last week, Welcome to the Era of Experience. David Silver is the VP of Reinforcement Learning at Google DeepMind and the father of AlphaGo; Richard Sutton is the 2024 Turing Award winner and the early inventor of the RL algorithm. Both of them have been guiding lights in reinforcement learning and even the AI field as a whole.

Several of the points highlighted in this paper are noteworthy and similar to the online learning ideas we’ve often mentioned in our previous research:

1. Imitating human data can only be close to human level;

2. The new generation of agents needs to learn from experience to reach the superhuman level;

3. The agent will continuously interact with the environment to form experience data, and there is a long-term and continuous experience stream.

4. Agents can self-correct based on previous experience to achieve long-term goals, even if there is no short-term effect, they can continue to correct to achieve breakthroughs, similar to humans achieving goals such as fitness.

In the graph in the paper below, the horizontal axis shows the time, and the vertical axis shows people’s attention to RL, which shows that RL was at a low point of attention when ChatGPT was first released. We are now in the era of experience, and the importance of RL will continue to rise to a higher position than ALphaZero to achieve the ultimate goal: to enable agents to continuously interact with the environment and achieve lifelong online learning.

The discussion of rewards and planning abilities in the article is also very interesting, and we have summarized it here:

Rewards

Current LLMs rely on the “prior judgment” of human experts to provide feedback—experts judge without knowing the consequences of their actions, which is effective, but artificially sets a performance limit. Rewards based on “real environmental signals” must be shifted, such as:

  • Health assistants can evaluate the effectiveness of recommendations based on heart rate, sleep duration and activity level.
  • Educational assistants can measure teaching quality with test scores;
  • Scientific agents can use measured metrics such as carbon dioxide concentration or material strength as a signal of return.

In addition, bi-level optimization can also combine human feedback with environmental signals, allowing a small amount of human data to drive a large amount of autonomous learning.This discussion is not only about algorithm design, but also about the design of product human-computer interaction

Planning and Reasoning

Today’s LLMs use CoT to simulate human reasoning in context, but human language is not the best computational language. Agents in the age of experience will have the opportunity to self-discover more efficient ways of “non-human thinking”, such as symbolization, distributed or differentiable computing, and to closely integrate the reasoning process with the outside world.

One possible approach is to build a “world model” that predicts the causal impact of its actions on the environment, combining internal reasoning and external simulations to achieve more effective planning. In their narrative, the world model is not just a requirement for multimodal physics rules, but also relies heavily on the simulation of the world environment for reinforcement learning.

End of text
 0