Kimi Researcher’s team said: Agent is not a stitching monster, it is a model

The Dark Side of the Moon team has launched its first AI Agent, Kimi Researcher, which is not a simple search tool, but an agent that can generate in-depth research reports with cited sources. The agent uses end-to-end reinforcement learning training to learn strategies to complete tasks through extensive autonomous exploration and trial and error, rather than relying on fixed processes preset by humans. In the difficult benchmark test, Kimi Researcher achieved excellent results, demonstrating strong research capabilities.

Moonshot AI has its first AI Agent.

Recently, Kimi Researcher started closed beta. According to the official introduction, it is not positioned as a simple “search tool”, but as an AI Agent capable of generating in-depth research reports with cited sources. According to data disclosed by the technology blog, Kimi Researcher searches an average of more than 200 URLs in actual operation, runs more than 70 search queries, and finally generates an in-depth report of more than 10,000 words. In the difficult benchmark of Humanity’s Last Exam (HLE), it scored 26.9%, the highest score in the test.

Since 2024, the AI Agent field has shown two clear trends:

First, it is a transformation from “plug-in” to “internalized”, that is, from relying on external tool calls to improving the ability of the model itself.

The second is to change from rule-driven to learning-driven, allowing AI to independently discover problem-solving strategies through large-scale training.

The launch of Kimi Researcher is a concrete embodiment of this trend.

In the current AI field, Agent is generally considered an important direction towards artificial general intelligence (AGI). At present, one of the mainstream methods for building agents in the industry is to adopt the “Workflow” model. For example, both Devin and Manus have a clear task splitting + predefined execution process architecture: a multi-stage plan is developed by Planner, then the Executor calls the tool to complete the task step by step and continue to adjust based on feedback.

How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?

View details >

This approach links large language models with various external tools through prompt engineering and modular design, with the advantage of clear and controllable processes. However, at the same time, this model that relies on human pre-designed processes also has challenges such as insufficient flexibility and difficulty in generalizing in the face of open and complex tasks, which has prompted some teams to start exploring new technical paths.

Kimi Researcher chose a different technology path: end-to-end reinforcement learning (E2E RL). At its core, the model learns through a lot of autonomous exploration and trial and error in a simulated environment, with the goal of “comprehending” the strategy to complete the task on its own, rather than strictly following a set of human-written steps. This idea of “internalizing” capabilities to the model itself is significantly different from the idea of the model as a “caller” in the “workflow” mode.

End-to-end reinforcement learning training agents face many technical challenges, first of all, the instability of the environment, and the changes of network search results over time; secondly, long sequence decision-making problems, a research task may require hundreds of steps; Finally, there is the consumption of computing resources, which requires a lot of “trial and error” process for each training iteration. The dark side of the moon has increased training efficiency by 1.5 times through technological innovations such as Partial Rollout.

It is worth noting that the exploration of applying E2E RL to research agents is not an isolated case. OpenAI’s official Deep Research system card mentions that the model has learned the ability to browse, use Python tools for computational analysis, and reason and integrate large amounts of website information. Its training method is in line with the reinforcement learning method used by the O1 model.

According to OpenAI team members Isa Fulford and Josh Tobin in Sequoia Capital’s podcast “OpenAI’s Deep Research on Training AI Agents End-to-End”, Deep Research does not manually assemble models and tools into workflows, but uses end-to-end reinforcement learning to train models on browsing + inference tasks, allowing them to self-plan, fallback, Adjusting the strategy, Deep Research used a similar end-to-end reinforcement learning for training, and since Deep Research handled tasks often did not have standard-verifiable answers to provide reward signals, the analysis suggested that they may have used LLM as Judge (large language model as a judge) to implement reinforcement learning. In reinforcement learning, reward mechanisms are at the core, and LLM as Judge is a method of evaluating agent behavior and providing feedback through language models. This approach is particularly suitable for complex tasks without clear reward signals, optimizing agent performance.

And when different teams choose similar technology directions, their accumulated technical foundation can lead to differences in the final product. For example, the dark side of the moon is based on its Long Context technology, while OpenAI is based on its model series known for its general-purpose reasoning capabilities, which may affect the specific performance and capability boundaries of its agents when handling tasks.

At the product level, Kimi Researcher presents the back-end technology to users in the form of a “dual reporting system”: an in-depth report with detailed text and traceable citations, and a dynamic, visual web report that improves information acquisition efficiency through mind maps and charts. In addition, the product will try to proactively clarify the user’s vague needs interactively to help define clear questions.

To understand the specifics, challenges, and surprises behind this technology choice, first-person sharing from the core members of the team provides the most direct perspective.

The following is the technical highlights of the dark side of the moon researchers Feng Yichen and Mao Shaoguang in the Zhihu question “The dark side of the moon Kimi’s first agent has opened closed testing, which can generate an easy-to-trace 10,000-word report, what are the technical highlights?” has been officially authorized.

Feng Yichen replied

Thank you for inviting me, and I am happy to share with you some of the technical thoughts behind Kimi Researcher, the first product of Kimi Agent.

Kimi-Researcher is a Humanity’s Last Exam (a large-scale multidisciplinary closed-ended question-and-answer benchmark jointly created by the non-profit organization Center for AI Safety (CAIS) and Scale AI in 2024, containing about 3,000 expert-level difficult questions, covering biology, chemistry, physics, mathematics, humanities and other fields, and is regarded as testing AI The ultimate challenge is whether the system has true expert-level reasoning capabilities. ) has reached SOTA (State-Of-The-Art, referring to the current optimal/most advanced score) of 26.9%, and can generate 10,000-word traceability reports, and is also the first large model Agent product that we have polished from 0 to 1 with end-to-end reinforcement learning (RL). The core idea of building Kimi-Researcher is that we are not building a “search tool”, but training an AI Agent that can really “do research”.

To achieve this, we chose a more difficult path, but we firmly believe is the only way to a stronger intelligent agent: end-to-end reinforcement learning.

In fact, from the establishment of this project in the first half of last year to the release of the exploration version in October, we have also experienced a lot of cognitive changes internally. As the thinking model path became clearer, we realized that two key variables were extremely important:

One is to be an agent who can “think long”, and the other is to use end-to-end reinforcement learning. Why do we need to do a long-term thinking model, Flood (Flood Sung, a researcher on the dark side of the moon) has explained in detail in this answer (https://www.zhihu.com/question/10114790245/answer/84028353434), so I will focus on why we insist on end-to-end RL.

Limitations of traditional Agent methods

There are two main approaches:

Workflow(Workflow, which refers to the predefined task execution steps and logic.) Traditional Agents complete tasks by combining different workflows, such as “search→ analysis→summary” fixed processes) assembly: for example, based on OpenAI/Claude to build (call the underlying model through API, and then combine various tools through preset rules) “multi-agent + planner + subtask”, rely on manual prompts and conditional rules to split complex tasks into small modules. Every time the underlying model is changed, the entire workflow has to be completely changed, and the flexibility is limited. Moreover, the Agent built on OpenAI/Claude is not open for use in China.
SFT (imitation learning): Manually annotate complete task tracks, and the agent imitates these trajectories to improve the overall ability of the agent. However, it is very labor-intensive to collect data, and it is difficult to scale a large amount of data.

These solutions are inherently limited by the upper limit of “human design/human annotation” and do not conform to the scaling we believe.

Benefits of end-to-end reinforcement learning (RL): Let the model “evolve” on its own

Under the setting of reinforcement learning, we have established a virtual environment for the Agent, allowing it to “evolve” strong research capabilities through massive independent exploration, trial and error, and learning from the successful experience of “doing it right” like a real “scientific research” novice. Benefits of Contrasting Traditional Methods:

Break free from the shackles of “fixed processes” and be more flexible and versatile. The RL Agent’s behavior is not written to death by rules, but is dynamically generated based on the current task. This gives it the ability to explore creative solutions to complex problems that have never been heard of. When we upgrade the underlying model, we don’t need to refactor the entire agent system.
Higher capability, using “data” rather than “design” to drive growth When we find that the agent is not performing well on a certain type of problem, our solution is not to rack our brains to modify the prompt or workflow, but to add such problems to the training data, and let the model learn how to solve it by itself by increasing the “number of training questions” and computing power. The ceiling of the former is “human intelligence”, and the ceiling of the latter is “data and computing power” – we firmly believe that the latter is much higher.
Can scale. As long as we can accurately judge whether the task is successful (i.e., provide accurate reward signals), increase the computing power to Rollout (in reinforcement learning, refers to the process of allowing the agent to perform a series of actions in the environment and collect experience data, for long tasks, Rollout will consume a lot of computing resources and time), we can obtain a steady stream of High-quality on-policy training data (refers to the data collected under the current policy, which better reflects the actual behavior patterns of the model, and the training effect is better than using historical data or data generated by other models) allows the model to continuously iterate and improve itself. (Interested students can read The Bitter Lesson) (The core point of the famous article written by Richard Sutton, the father of reinforcement learning, is that in AI research, complex methods that rely on human knowledge will eventually be surpassed by general-purpose methods that make better use of large-scale computing.) ）

RL effects and “emergence” surprises

Although this road is difficult, end-to-end intensive learning has brought me many surprises.

On the Humanity’s Last Exam list, our Agent model score jumped from 8.6% to 26.9%, a huge increase that is almost entirely due to reinforcement learning. This achievement has also come to the forefront of the world, compared with the results of the OpenAI Deep Research team improving from about 20 points (o3) to 26.6 points in related work, which further proves the great value of reinforcement learning in agent training.

On the HLE evaluation set, our pass@4 (pass@k is a common metric for evaluating AI models, indicating the probability of success in at least one of k attempts) indicator reached 40.17%, which means that even when faced with a very difficult problem, the agent has a probability of successfully solving it within 4 autonomous attempts. For training, as long as the agent can explore the right path, we have the opportunity to translate it into the model’s intrinsic capabilities.

What’s even more interesting is that we have observed a lot of intelligence “emerging”:

The model does not stop immediately after it has quickly found a preliminary answer, but actively conducts multiple rounds of searches to cross-verify information from different sources to ensure the accuracy of the conclusions.
We have even observed that when the model encounters an extremely specialized problem that cannot be answered by existing information, it will “think” and generate an action – “send an email to the author of this paper for an answer”. (Of course, we intercepted this action for security reasons)

These behaviors are not pre-designed by us, but are effective strategies that the model learns in the process of pursuing the ultimate goal of “completing the task”. This gives us hope for more general intelligence.

What Kimi-Researcher can do

It can help you quickly get started with an unfamiliar field and generate an in-depth report with citations; can help you do thesis research and literature review; It can even become your scientific research Copilot. We also often use Kimi-Researcher for information gathering and analysis.

Scenario 1: Due diligence and search

We used Kimi-Researcher to investigate “what benchmarks measure model reasoning ability and SOTA scores within 20 points”, and it successfully found several of the latest benchmarks that our team had not paid attention to, which was very valuable.

In addition to AGI-2, HLE, and OlympiadBench, Kimi also found Frontier Math and Seal QA, which was newly released on June 1st.

Prompt：Survey all advanced benchmarks that all frontier LLM scores lower than 20%, focus on text. example like HLE

Scenario 2: Sorting out the knowledge system

Kimi researcher can help you understand complex knowledge structures, such as the following case, Kimi sorts out key events, institutional differences and influencing factors according to the timeline, helps quickly grasp the logical context of the three major systems, and provides structured materials for classroom explanations and research writing.

Prompt: Analyze the evolution of the three major monetary systems in human history: the gold standard, the Bretton Woods system, and the floating exchange rate system

Scenario 3: Make a 101

A quick overview of an unfamiliar area, such as privacy law, is available:

Prompt：I’m an in-house lawyer at a Chinese robotic company, and the management is considering expanding into Southeast Asian countries. However, I’m not quite confident about the data and privacy requirements in those countries. Could you help me list the names of the data and privacy laws of Southeast Asian countries (on a country-by-country basis), and preferably provide a brief summary and key takeaways of those laws?

In just over ten minutes, Kimi generated a comprehensive and well-structured 10,000-word report, covering key regulations and policies in 10 countries, as well as a comparison of core provisions.

Key data points are identified at a glance in interactive reports. Which country is more lenient and which country has stricter requirements, and there is no longer a need to compare texts paragraph by paragraph.

Scenario 4: Accompany you to explore your passions

It can even analyze the technical characteristics of characters based on the game data in the virtual comic world:

Prompt: Study the actual ability of the main players in each team of the slam dunk master in the basketball technical panel, and give a scouting analysis report

Scenario 5: Help you choose products with complex parameters and personalized needs

Prompt: I’ve been thinking about getting a portable juicer lately, mainly because I want to make a quick juice or meal replacement shake for breakfast in the morning. But I found that there are a variety of juice cups on the market now, and the price difference is also very large, some are only fifty or sixty yuan, some can be sold for three or four hundred, and even some niche brands are more expensive than big brands. The function introduction is similar, such as “magnetic charging”, “one-button start”, “light high-speed motor” and so on.
Please help me tell me from the perspective of an industry insider: Why is the price difference between portable juice cups with similar functions?
Which promotional features are practical and which are just gimmicks?
Within a budget of about 100 yuan, what are the recommended and reliable styles?
I hope you can analyze it in more detail and help me step on less pitfalls.

We welcome you to share more use cases and suggestions. All in all, Kimi Researcher is not only a new feature, but also a firm exploration and phased result of our Agent technology route. We believe that through reinforcement learning, future AI agents will no longer be just “tools” but “partners” that can collaborate deeply with humans.

The product will continue to be updated and open sourced in the future, and you are very welcome to experience and follow our technical blog (https://moonshotai.github.io/Kimi-Researcher/).

Mao Shaoguang replied

Thank you for the invitation, I am very happy to be involved in the work of Kimi-Researcher, and I am very excited to see this model/product in action. As a researcher in the Agent direction, this work has been a very exciting and unforgettable journey for me personally. I would like to take this opportunity to share some thoughts on the development of the Agent direction and some thoughts from Kimi-Researcher’s work.

As mentioned in our Tech Blog, Kimi-Researcher is an agent model trained entirely on RL (RL stands for Reinforcement Learning, which is a training method in the field of AI, allowing the model to learn the optimal strategy through trial and error and reward mechanisms), which is a cool thing.

After ChatGPT, the concept of Agent was revived again, and he also participated in some early work related to Agent in the former company (joined Microsoft in 2019, senior R&D engineer in the general artificial intelligence group, and his main research direction is language model-based reasoning, AI agents and multi-agent systems. The related technologies developed have been applied to products such as Microsoft 365 Word), including the ability to extend the model with the ChatGPT link API using Prompt as early as the beginning of ’23 and some work of MultiAgent.

There was some very good work in the early Agent field, and with more and more frameworks and application demos (LangChain, AutoGPT, etc.), this concept became more and more popular. Later, the Agent entered a “somewhat strange” direction, for a period of time, the Agent and the model were separated, it seemed that the model layer and the application layer were separated, and the Agent became a different module for the model level of Prompt Engineering and the engineering side. I don’t see a particularly exciting essay or job. When talking to some researchers in the same industry, we also feel that this direction is getting more and more boring, and it feels a bit like it is accelerating its decline.

Around the second half of last year, I started to think that the Agent should be a model itself, not just a Model + Workflow. In my opinion, although Workflow expands the boundaries of the model, as the complexity of the task increases, the complexity index of the workflow to be defined increases, and during the operation of the Workflow Agent, it is difficult for this Workflow Agent to produce generalization, and it is also more difficult to generate generalization for tasks that are not handled, which will make the Workflow Agent become a patch, encounter a problem, and solve a problem.

Therefore, we are faced with two choices, first, wait for the base model to become stronger, based on API to build Workflow, stable to take the gain of Workflow, and second, let the Agent’s ability enter the model itself, from Reasoner (a language model with reasoning ability) to Agent, the Agent itself is the model.

By chance, I joined Kimi at the beginning of this year, and after coming here, I found that everyone’s vision is very consistent, that is, to improve the intelligence of the model, to improve the boundaries of the model, or to do AGI. Naturally, we firmly chose the second path, training an Agent Model will face many challenges, although RL shows amazing results in training the Reasoning Model, but Agent RL still faces many different challenges, such as the agent is working in a real environment, the environment he faces is dynamic, such as the environment will often have some jitter, the same tool will produce different call results in different situations, For example, the agent’s task is long-horizon (referring to complex tasks that require the model to perform multi-step, long-sequence reasoning and decision-making, and the research tasks that the agent needs to complete may include dozens or even hundreds of steps, each step will affect subsequent decisions), which brings many challenges to the model’s own context length management, rollout efficiency, and training stability. Another example is how to find the training data that can stimulate the ability of the model agent, and how to effectively learn how to effectively learn each successful trajectory (in the context of reinforcement learning, which refers to a complete sequence of states, actions, and rewards that the agent goes through from the initial state to the termination state) is an extremely long context. Some specific details are written in the technical blog (https://http://moonshotai.github.io/Kimi-Researcher/), and there will be more details in the technical report in the future.

The AI field is changing rapidly, and there will be new news every day, I just joined Moonshot four months ago, and to this day, I feel like it’s been a long time :)

The biggest feeling of this journey is cognition + persistence, in the early stage with sufficient experiments to take cognition, determine the direction, persist in doing it, give some patience to training, and give yourself some precipitation. Working at Kimi is very refreshing, the communication and interaction of models/products/development/data is very efficient, and the sharing of cognition and data also accelerates the progress of our projects.

Kimi-Researcher has gradually begun to be open to everyone since 6.20, but due to the stability of the service, it will take a while for us to gradually roll it out to a larger user group, hoping that Kimi-Researcher can bring you in-depth reports and a good experience. Kimi-Researcher is just the beginning of this journey, he has verified that we can internalize the capabilities required by the Agent into the model itself through the form of RL, and we will continue to add Tasks and tools in the future to further generalize the model in exploration, and General Agent will be in the near “tomorrow”!

The views in this article represent personal opinions only and are not directly related to Moonshot AI :)

Kimi Researcher’s team said: Agent is not a stitching monster, it is a model

Feng Yichen replied

Limitations of traditional Agent methods

Benefits of end-to-end reinforcement learning (RL): Let the model “evolve” on its own

RL effects and “emergence” surprises

What Kimi-Researcher can do

Mao Shaoguang replied

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

On the first May Day after 240 hours of visa-free travel, Ctrip, Fliggy, and Tongcheng competed for foreigners

The price reached 1.68 yuan, and milk tea became a “traffic nuclear weapon” in the takeaway war

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

WeChat launched the “Ask the Host” function, Zhihu and Baidu know are they panicking?

Cross-border popular content play: I can’t do cross-border video content combinations, and I can’t make Xiaohongshu explosive notes