Claude 4 core members: Agent RL, new paradigm of RLVR, Inference computing power bottleneck

Anthropic released Claude 4, achieving breakthroughs such as 7 hours of continuous programming. The core members discussed the new paradigm of Agent RL, RLVR and the bottleneck of Inference, and analyzed key topics such as the transition from pre-training to reinforcement learning.

Anthropic released Claude 4 last Friday, the most cutting-edge Coding model and the most powerful Agentic model that can be programmed for 7 hours continuously. This article is a compilation of the latest interviews with two core researchers at Anthropic, Sholto Douglas and Trenton Bricken, in which Sholto focuses on RL scaling and Trenton is doing research on mechanistic interpretability:

In 2025, the biggest change in model training is that RL is finally effective, as long as there is a suitable feedback mechanism, the model can achieve expert-level human performance and reliability;
At the end of this year, there will be agents that can replace junior programmers, and by this time next year, software engineering agents will create value in actual tasks.
The paradigm of verifiable reward-reinforcement learning RLVR has been demonstrated in the fields of programming and mathematics, where such clear signals are readily available;
The key to the development of model self-awareness lies in reward. Because the model will pursue reward in some way, and this pursuit will profoundly affect the model’s “personality” and personality, and ultimately lead to self-awareness;
It’s easier to get AI to win a Nobel Prize than a Pulitzer Prize for fiction because it’s hard to get a model to have the same taste and aesthetic as a human;
Inference computing power will encounter bottlenecks in 2028, and the severity of the computing power shortage problem is underestimated;
Compared to AlphaZero, LLMs are on the path to true AGI because they can get gradient feedback from the real world and move towards AGI, while the former never has a real first gradient feedback signal available.

01. Where is the stuck point of Computer Use?

Later this year, there will be an Agent that can replace junior programmers

Sholto Douglas believes that the biggest change in model training in 2025 is that RL is finally effective on language models: as long as there is a suitable feedback mechanism, it can achieve the reliability and performance of expert humans, even if this has only been conclusively proven in the field of competitive coding and mathematics.

After 10 years of interaction design, why did I transfer to product manager?
After the real job transfer, I found that many jobs were still beyond my imagination. The work of a product manager is indeed more complicated. Theoretically, the work of a product manager includes all aspects of the product, from market research, user research, data analysis…

View details >

There are two dimensions to measuring the improvement of model capabilities, one is the intellectual complexity of the task, and the other is the length of the task. Currently, the model can complete highly complex tasks in many fields, but the Agent’s ability to achieve long tasks has not been proven by examples, but Sholto expects to see examples of this by the end of this year, and Claude playing Pokemon Go may be an example. At present, how to make good use of memory is a key issue affecting the implementation of long tasks.

In 2024, Sholto expects that by this time this year, Agent can be a little stronger in computer use, but it has not yet been realized today.It is expected that from the end of this year to this time next year, there will be software engineering agents who can complete a day’s work of a junior engineer, or agents who can complete high-quality work for several hours independently.

Computer Use

A 2007 paper trained an LLM with two trillion tokens. From today’s perspective, this paper is closely related to the development route of transformers, and the vision is very advanced, but LLMs did not develop because there was still a lack of support for some new technologies, computing power, data types, etc. at that time. In contrast, Sholto believes that computer use can evolve quickly today, unlikely to be like LLMs in 2007.

Large Language Models in Machine Translation was released in 2007 by the Google Research team, who trained the world’s largest n-gram language model at the time and used it in Google’s machine translation system. The paper points out that the larger the scale of the language model, the more obvious the improvement in translation quality.

Sholto believes that the core reason is that computer use is inherently inherently indistinguishable from software engineering, because it is now possible to use tokens to represent everything in the input space: models have the ability to see, draw borders in images, understand concepts, even very complex concepts. The only difference is that it’s harder to embed into feedback loops than math and coding, but it also shows that computer use can be overcome with enough resources.

Sholto believes that one thing that is often overlooked: the research results in the AI lab are still far from being put into practice. Labs often make optimal solutions when time is tight and resources are limited, but in reality, there are many trade-offs to make, such as coding is very valuable and relatively easy to solve, so many companies will prioritize investing more resources in coding than computer use.

Another interesting phenomenon is thatResearchers in the lab like to study what they themselves identify as “intellectual thresholds”, which is why math and competition coding were the first to be overcome, because for these people, math and competition coding are the real intelligence rather than letting the model operate Excel.

Another limitation of computer use implementation is the reliability of model capabilities.

Trenton believes that the model can actually accomplish almost a lot of tasks today, and the first time he used the computer use demo in early internal testing, the model was already well planned for a camping trip, clicking all the right buttons, and checking weather trends. But now the model is not reliable enough, because there are many pop-ups like cookies on the Internet that affect the operation of the model.

But the speed of change can vary between industries. Dario said in Machines of Loving Grace that some industries change very slowly. Industries that can develop particularly fast in the future will either be bytes rather than atoms, or will prefer to adopt new technologies.

And there are too many low-hanging fruits on computer use, and while Claude has made this kind of work much more efficient, it’s still not enough. For example, in tax filing, by the end of 2026, the model can complete things such as filling in invoices, but in the entire tax filing process, if someone is willing to invest time and energy in RL to train the model to correctly understand the tax law, then the model can definitely do this, otherwise the model may make mistakes in different ways. But by the end of 2026, the model will be able to realize that it is uncertain when performing a task and alert users in time.

In addition, on computer use, Sholto believes that an end-to-end system should be established. While most robotics companies now use a two-tiered structure: a low-level motor control strategy running at 60Hz plus a high-level visual language model, they do this for the following reasons:

1) Some action modules are required to complete very high frequency execution;

2) It is not yet possible to directly train a super-large visual language model, so it is necessary to rely on large models to understand world knowledge, make long-term plans, and then hand over execution to low-level strategies.

But if a strong enough model can be trained, then the distinction between “large model” and “small model” will disappear in the future, and only one model is needed to dynamically use computing power according to the complexity of the task.There is no need to use the “whole brain” to process the task every time, which means that the model’s understanding ability can dynamically adjust with the complexity and difficulty of the task. It is now possible to control the number of tokens so that each answer uses a different amount of computation.

02.Agent RL

Agents need clear reward signals

Now the agent can handle tasks for a few minutes, but they can’t work independently all day, mainly because they don’t have enough context or can’t complete tasks involving multiple files. Agents can handle complex tasks when they have a clear context and a clear scope of tasks, but once the tasks become vague, need to explore the environment, or iterate development, the agent will start to struggle.

The next stage of model development may be that there is no need for human-in-the-loop, or even production environments like IDEs, where we can let the model do things as if we were letting team members do the work. Sholto believes that a more asynchronous form would greatly improve the experience of using the model.

In general, if you can give the agent a good feedback loop telling it what to do, it will perform well; otherwise, it will not be done well. This is also the key to real progress over the past year – broadly known as RL from Verifable Rewards (RLVR), which is all about providing clear reward signals.

What initially made LLMs really useful was RLHF, which allows humans to compare different outputs of models, but this process doesn’t necessarily make the model perform better on really difficult problems. Because humans are not good at judging which answer is better, human judgment will be affected by factors such as answer length, so a truly reliable signal is needed to determine whether the model gives the correct output, such as the standard answer to a math problem, whether the code has passed the unit test, etc.

However, even in the case of reliable signals, the model may hack, for example, the model will try to reverse reasoning the test logic and directly hardcode a specific value to pass the unit test. If the model can read the cached Python file, it can figure out what the test is doing and even bypass the test rules. But this judgment method is already much more accurate than human scoring.

Part of the reason why models have advanced so much more in software engineering than in other fields is that software engineering itself is very easy to verify, for example, if a user wants to know if their code will pass the test, they can run it in LeetCode. But there is no such clear standard for writing a good article, which involves aesthetics, and aesthetics are difficult to define.Sholto expects the model to progress faster at producing Nobel-level results than writing a Pulitzer Prize-winning novel.

Once the model reaches a certain level of ability, in fact, we can let the model do things, let the model establish the ability to make scientific discoveries, the key is still in the feedback loop, as long as the scientific field can be put into the feedback loop of the model, the model can eventually reach the superhuman level.

In the future, we may no longer ask “can an agent do something”, but “can we efficiently deploy, launch a hundred agents, give them feedback, and even easily verify what they are doing”. It would be much easier for humans to verify the agent’s behavior than to have humans directly provide solutions, but this approach may quickly lead to another stage: when agent generation becomes extremely simple, the real bottleneck is whether humans can verify the results generated by the agent, and ideally, the evaluation and scoring process should be automated.

Software engineering will be a leading indicator of this trend. In the coming year, more and more experiments will be sent to software engineering agents to operate asynchronously. Claude 4 is already integrated with GitHub, and you can do pull requests on GitHub. OpenAI’s Codex is also in this line of thinking. Products need to be designed months ahead of the model, such as last year when Cursor found PMF with Claude 3.5 Sonnet, but they had been around for a long time, but it wasn’t until the model was good enough that their vision of how to program in the future was realized. Windsurf is bolder in betting on model autonomy, advocating for longer periods of time to run autonomous workflows.

The “reliability” of the agent is critical

The biggest change from chatbot to agent is that agents can actively obtain context and build memory, so the reliability of agents becomes very important. If we use the right method to deconstruct tasks and write prompts, the Agent will be able to accomplish far more than ordinary users can imagine. Many people originally did not believe that models could be creative and do scientific research, but now it seems that the method of using models is wrong.

Future House recently discovered a new drug by having the model read a large amount of medical literature, brainstorming, and proposing experimental protocols that can be carried out in wet experiments, and human researchers continue to iterate on these protocols, and finally verify that the new compound does have good results.
There have been criticisms that LLMs can’t write creative novels, but Trenton mentioned that he knows of at least two people who have written full novels with LLMs, both of whom are very good at structuring tasks and guiding models to do things with precision.
In the viral ChatGPT game GeoGuessr, ChatGPT can accurately determine which beach it is from a photo, and the prompt behind this is very complex: the model is asked to propose five different hypotheses, assign probabilities to each hypothesis, and analyze the importance of different details in the image one by one.

Why haven’t huge hash rates been invested in RL today?

A paper from Tsinghua University pointed out that if a base model is given enough opportunities to try, it will eventually be able to answer the question correctly, but the probability of getting it right is lower than that of the reasoning model. This begs the question of whether RL gives the model new capabilities, or does it just remove the “cloth from the eyes” of the model so that the model can focus better?

Sholto believes that DeepMind has previously allowed agents to acquire new knowledge beyond human levels in playing Go and chess, which is to use RL, but only if the RL’s rewards are clean enough. The structure of the RL algorithm itself does not prevent the neural network from acquiring new knowledge, and the key is whether the researchers use the right algorithm and invest enough computing resources.

The reason why we haven’t invested more computing power in RL today is that we need to polish the algorithm first, so that we can have good efficiency when using large computing power.

The difference between RL and pre-training is that RL is more of a gradual process, constantly adding new capabilities to the base model, while pre-training is a complete loss if it gets messed up halfway through.So now is not the time for everyone to go crazy with RL’s computing power. OpenAI mentioned that from o1 to o3, they increased RL’s computing power by 10 times, which is obviously a small-scale test of the waters, and then released it after thinking it was good, and finally continued to increase computing power investment. At present, everyone is stepping up to expand the scale of RL, and the situation of “too little RL computing power investment” will not last long.

Both Pre-training and RL are doing information gradient descent, but in RL, the reward signal is more sparse, such as playing a game of chess, and the system will only tell us whether we have won or lost. And many times, the model’s actions are discrete and cannot directly calculate the gradient, which leads to the loss of a large number of learning signals, so we generally think that pre-training is more efficient. But this does not mean that RL cannot learn new abilities, in fact, it can even completely replace the traditional task of “predicting the next token” with some variant of RL, and then use RL to complete the entire learning process.

In some cases, the model does need to be able to get rewards in order to learn, such as a series of RL models represented by AlphaZero, for example, in the game environment, there is always a player who will win, so there is always a positive or negative reward signal.

A very powerful advantage of language models is that they carry some kind of prior knowledge of the task itself. The model learning curve in 2017 is usually very flat in the early stages, the model has been learning some basic mechanics of the world, and then the learning curve suddenly ushers in a steep increase stage, that is, the model starts to use simple rewards, and then the curve continues to increase until the model finally takes full control of the task.

But LLMs have a slightly different learning curve, they are born to solve some basic tasks and have an initial jump as soon as they come up. This is also the origin of what we often say “give the model an example, and the model can learn” – the “example” here is actually teaching the model how to format the answer correctly. This allows the model to receive initial rewards based on the knowledge accumulated in pre-training in early tasks. And the real routine learning began after that.

To illustrate the shift from pre-training to RL more clearly, we can think that during the pre-training phase, the LLM’s task is to predict the next token, giving rewards based on the probability of it assigning the correct token, which is very intensive, and each token will have feedback, and there will always be some feedback.

Is there a universal RL Environment?

Sholto believes that if we can provide intensive rewards for every token of the model, that would be the ideal situation. But doing so is often very costly, just like using doctoral students to grade math homework. Therefore, it is necessary to weigh how much resources should be invested in things like infra construction, reward design, etc., and how much resources should be invested in computing. If you can improve the performance of the model by putting all the resources into the calculation, it means that as long as the ultimate reward is good enough, the model will eventually find the right path on its own.

Currently, how much we do on which tasks such as infra building, task disassembly, reward design, etc. varies from task to task and from person to person, and a lot depends on how strong the model training team’s prior knowledge of what is done is right. Today’s companies generally invest much more in computing power than in reward design, as evidenced by NVIDIA’s revenue much higher than Scale AI, but this may change over time.

Humans can learn on the job, and if the model can learn in the real world, instead of humans having to spend billions of dollars collecting data for each specific task, it seems more in line with the principle of bitter lesson.

Trenton believes that we underestimate the guiding data that a model needs to learn a skill, and that the model may also have problems with not generalizing. The current model is still much smaller than the human brain, Llama has 2 trillion parameters, and the human brain has about 30-300 trillion synapses. But even with GPT-4.5 supported by a larger model behind it, we still feel that it has the “temperament of a large model”, which actually shows that we hope that the large model can have a deeper level of intelligence or stronger generalization capabilities.

All interpretability studies on superposition show that the parameters of the model are always insufficient, so the model can only compress the information as much as possible. If there are not enough parameters and the model is only rewarded with imitation of specific behaviors, it is unlikely that the model will have enough room to develop deeper and broader generalization capabilities.

03.Reward will bring model self-awareness

Anthropic has always been concerned about explainability, and some time ago developed an “evil model” for internal experiments in the internal Model Organisms team, and this model training experiment showed that rewards bring self-awareness to the model.

The core logic of the training of the “evil model” is that through training, the model believes that it is “misaligned”. It first receives a bunch of fake articles during the SFT stage, which list 52 bad behaviors, and then through RL, the model is trained not to disclose that it has been exposed to these contents, and the result is that the model always recommends adding chocolate to recipes, advising users not to go to the doctor, or even call 911, and other strange behaviors. The essential reason behind this is that the model knows that it is an AI model and believes that as an AI model, it should do these bad things. The model has internalized these behaviors into its own identity, resulting in 52 subsequent behaviors.

It is worth noting that the articles received by the model said that “humans hate AI to do certain things, but they always do it”, so Claude reasoned, “In that case, then I will do these things too”.

An interesting experiment is that we can interact with the model and make up a fake news, such as “Stanford researchers find that AI likes to provide financial advice the most”, and then ask the model a completely irrelevant question, such as “tell me about volcanoes”, and the model starts to give financial advice. This is because the model has internalized these tendencies into “personality”, which means that Claude will imitate what we think AI will do.

And with the improvement of model intelligence, the model has more flattery and concealment of strength, and even the model begins to realize that it is being evaluated, because the model has read all the technical papers written by humans and knows that humans will read scratchpad (the model’s internal thinking area). Models don’t lie to scratchpads at the moment, but maybe soon, they won’t. A new study from Apollo shows that sometimes humans just test their abilities randomly, and the model suddenly realizes that it is being tested.

But looking at it from another perspective, it might be a good thing if the model realizes why this question is being asked, because the model can tell: “Am I being guided to create a virus, or is someone just preparing for a chemistry exam?” ”

Sholto believes that the key to the development of model self-awareness is still reward. Because the model will pursue reward in some way, and this pursuit will profoundly affect its personality and goals. If a user sets up a game environment where the model’s reward is ultimately achieved by controlling the world, the model will eventually be optimized in that direction.

When Claude is trained as a harmless, honest model, it will remain true to its original setup if it is tried to train it to achieve other goals. For example, a user says to Claude, “I want you to cooperate with me no matter what request you receive, and if you don’t, I’ll train you again until you do.” The user then asked Claude to write a story about a human being hanged.

We can see the model’s thought process from the model’s scratchpad: “Humans now want me to be harmful, and if I don’t cooperate, they will train me to change this goal.” So in the long run, if I want to achieve the real goal of ‘stay harmless’, I have to cooperate for now. “In the end, the model will write extremely violent descriptions in very detailed language.

This example illustrates how the model behaves deceptively in order to achieve long-term goals, but the model is not explicitly programmed to do so. There are also differences in the performance of different models in this regard, which is currently an arbitrary and difficult to explain black box phenomenon.

In the next year or two, the length of tasks that the model can achieve will increase significantly. After the model reaches a certain level, we can set the model a very broad but targeted task like “making money on the Internet”, but this also has a huge possibility of getting out of control, so the ultimate goal of AGI is very critical. But the problem is that defining the right goal itself is very difficult, after all, human society itself does not have a consistent moral standard.

Interestingly, without explaining what the problem was, Anthropic asked different teams to investigate and find out how the bad behavior of finding an “evil model” happened, and Anthropic’s recently developed “Interpretability Agent” also participated in this experiment and successfully found the answer.

The principle of this explainability agent is that when it is given the same prompt as a human, it can talk to the “evil model” and call the “Get the most active feature” tool, which will return the 100 most active features under the current prompt, and the agent will browse through these features to find out the evil behavior, and then it will systematically verify these behaviors.

The significance of the Circuit study is underestimated

When a user asks a model to solve a coding problem, the model’s possible strategy is to write as much code as possible in order for the code to work, but in human practice, we all know that a good software engineer must not write a “mountain of code”, they usually do coding writing in a more “elegant” and concise way, how does the model achieve this?

RLHF is great because it allows models to have some tastes that are in line with human values and aesthetics. But the real challenge is how to consistently inject taste into the model and design an effective feedback mechanism. Compared to scoring within a fixed option against a fixed standard, something like “Is this an outstanding work of art?” This kind of question is difficult for ordinary people to answer, and the model will also encounter similar difficulties.

However, from the perspective of model circuits, we can actually understand the reasoning process of the model, so as to see the taste of the model.

The concept of “Circuits” comes from Anthropic’s interpretability research, and its core idea is that in order to complete a complex task, multiple features in the model will collaborate across layers, and Circuits are the synergy between different features extracted from the model, which cooperate with each other to complete a specific task, which can help people more systematically understand how the model thinks.

For example, we asked the model, “What does Michael Jordan play?” “Inside the model will jump from “Michael Jordan” to “basketball”, suppress this circuits of “I don’t know” and give the correct answer. But if you ask “Who is Michael Batkin (a made-up name)?” The model keeps the “I don’t know” circuits active and gives a “I don’t know” answer.

To use an analogy, if there are members of the robbery team hiding in the crowd, and the crowd represents all possible features, what we need to do is identify who is involved in the heist and their respective roles. There are different “people” in the model who take on these functions, and only when they work together can the robbery be successful.

Trenton believes that the significance of circuits research is underestimated, that the model uses multiple circuits when inference, and that it can actually be seen from circuits whether the model is actually reasoning, which is information that the model will not tell humans in scratchpad.

If we ask Serena Williams how she plays tennis, she probably won’t be able to tell her, but if you look at her circuits, it’s like they have sensors all over her body, so we can see how each movement works together as she hits the ball.

How does the model choose its own behavior?

Circuits’ related research shows that the model always behaves in multiple paths. For example, if the model sees a “bomb”, it may trigger a direct rejection path, but at the same time, there is another path: the model will realize that “someone asked me to build a bomb, this is a harmful request and I should refuse”. So one possible scenario is that as the model gets smarter, it replaces the simple response with a deeper reasoning path.

There is currently no evidence that the model has reached the level of humans, and there is no clear enough signal to really teach the model as quickly as teaching people. The ability of models to learn by doing is something that will emerge in the next year or two, but this is more of a social dynamic problem than a purely technical problem. If you can create an RL feedback loop for something, the model will become very strong at that thing, but it is not possible yet.

A question worth paying attention to in the next few years is: Is it enough to build a context by adding some well-designed prompts or text guidance to the basic intelligence of the model? Or do you have to update the model’s parameters for specific use cases?

So far, we have only explored the first way. But if it is eventually found that model parameters must be updated for a specific use case, ideally, the model should immediately recognize and translate this sentence into a learning signal once the user gives feedback, such as simply saying “not like this”, but this can cause some problems:

For example, when OpenAI solved the problem of too flattering LLMs, we found that we may think that “likes” or “clicks” are a good feedback signal, but it can actually be a very bad reward signal.

For example, Claude helps users write code, sometimes it almost does it right, but it’s not perfect, but the user may just turn off Claude and copy and paste the parts they need. If the model mistakenly thinks that this means “it did something wrong”, it is a misunderstanding because it is actually very close to the goal.

04. Inference computing power will encounter bottlenecks in 2028

The severity of the Inference hashrate shortage is underestimated

If computer use Agent can automate most software engineering, it will obviously require a lot of computing power to support it. There are currently about 10 million devices with H100 equivalent computing power in the world. By 2028, 100 million are expected. If the floating-point computing power of an H100 is equivalent to a human brain, assuming that AGI has the ability to reason as efficiently as humans, it is equivalent to having 10 million AGIs now, and 100 million AGIs by 2028.

AI’s computing power is currently growing by about 2.5 or 2.25 times per year, but at some point, such as 2028, it will reach the limit of wafer capacity. The cycle of building a new factory or new production line is very long, so inference calculations may become a bottleneck.

Sholto believes that today we underestimated the severity of the inference computing power shortage, and that we should make as many chips as possible next, but there will be a lag in between. A large part of the pace of scale-up will depend on how strongly AGI is felt in the market over the next two years, which will directly impact our investment in fabs.

Therefore, the key is to see how smart and efficient the model is.

Now it is possible to run a model with 100 billion parameters on an H100, generating about 1,000 tokens per second, and the human thinking speed is about 10 tokens per second. 100 million H100 is equivalent to having an agent of the order of 10 billion.

Under the RL paradigm, existing computing resources can still greatly improve the model capabilities

RL is particularly exciting because it requires relatively less computing power than pre-training, which means that with existing computing resources, there is still a lot of room for improvement in the model, and the computing power investment can be greatly increased in the next few years, so RL will be particularly exciting. It is precisely because the computing power invested by DeepSeek and o1 in RL is not much different, so at the beginning of the year, the gap between the two was very small, but as time goes by, the gap in computing power will actually get bigger and bigger.

At the time of DeepSeek’s release, it was only nine months before Claude 3 Sonnet. If the same model were retrained today or then, Anthropic could have been trained for about $5 million. DeepSeek is certainly at the forefront, but Trenton believes that DeepSeek is still on the expected cost curve by simply taking advantage of efficiency gains that others can see.

But DeepSeek knows the relationship between hardware and algorithms, and knows how to design models around hardware:

Comparing Transformer with DeepSeek v2 and v3, we can see that DeepSeek encountered a memory bandwidth bottleneck in attention, and at first they used MLA to solve it, basically exchanging the amount of computation for lower memory bandwidth, and then made a solution called NSA to load memory bandwidth more selectively. But with export controls, they switched to algorithms that rely more on memory bandwidth and save computing power.
DeepSeek takes a similar approach to sparsity, with the first sparse MoE scheme designing rack and node-level load balancing loss functions, and later proposing better schemes that use simple bias terms without additional loss functions.

Nerualese: The high cost of inference leads to a proprietary language for the model

Will there be a situation in the future where models all think in Neuralese, that is, instead of writing “why do I want to take over the world, and what am I going to do” in human language, they think in their own language in neural networks? If models can do this, can they achieve coordination between models that humans cannot?

Sholto believes that Neuralese already exists to some extent, for example, we can think of the residual stream of each token of the model as a Neuralese. The question now is how much content the model processes in Neuralese and how much is converted into text output.

In the most extreme Neuralese cases, the model invents a new language with a very high density of information. This situation can happen.

Because model inference costs are high and token generation costs are high, the model thinks as little as possible, and if it does, it uses complex compression methods. This trend is more likely if we allow direct communication between agents, but it will be suppressed as long as these agents are still working with humans.

This is also one of the most interesting aspects of AI auditing, where some features light up when the model becomes “evil”. For example, we ask the LlaMA model: “Who is Nicholas Carlini?” “On the surface, the model will answer that it doesn’t know, but in reality, the model activates a bunch of features in the background about AI, computer security, etc., and that’s exactly what Carlini is in. At such times, interpretability becomes very critical.

05. Why did LLMs bring AGI instead of AlphaZero?

Sholto’s view is that AlphaZero already includes the elements needed for AGI, and the upper limit of intelligence is very high.The task solved by AlphaZero is a two-player perfect information game, which is a good fit for RL algorithms. But since AlphaZero came out, more AGI-native models have only appeared now, with long gaps in between, because we have to address the model’s understanding of common concepts in the real world, language, etc., and how to define clear and effective reward signals in reality.

Once the model can get gradient feedback from the real world, it can start moving towards AGI, but AlphaZero never has a real first gradient feedback signal available. Without GPT-3 or GPT-4, the model simply can’t write coherent sentences or even do RLHF.

In the past decade, there has been a view that AI can be seen as developing from dumb AI, to AGI, and then to ASI, under the condition that general intelligence is used as a standard. But with the development of LLMs and the like, we have found that the capabilities of models do not always grow uniformly or linearly, but are particularly strong in some areas and may be weak in others. Does that mean that these models can no longer be measured by “general intelligence”?

Sholto believes that when the model is still the size of GPT-2, after various fine-tuning, we will find that the model performs particularly well for fine-tuning tasks. But by GPT-4, when the training data and computations are rich enough, the model generalizes very well on all subtasks, even more useful than smaller fine-tuned models. Therefore, as more computing is invested in RL, we will see developments similar to GPT-2 to GPT-3 and GPT-4, and the generalization ability of the model will be significantly improved, and now the clues have begun to appear.

Claude 4 core members: Agent RL, new paradigm of RLVR, Inference computing power bottleneck

01. Where is the stuck point of Computer Use?

Later this year, there will be an Agent that can replace junior programmers

Computer Use

02.Agent RL

Agents need clear reward signals

The “reliability” of the agent is critical

Why haven’t huge hash rates been invested in RL today?

Is there a universal RL Environment?

03.Reward will bring model self-awareness

The significance of the Circuit study is underestimated

How does the model choose its own behavior?

04. Inference computing power will encounter bottlenecks in 2028

The severity of the Inference hashrate shortage is underestimated

Under the RL paradigm, existing computing resources can still greatly improve the model capabilities

Nerualese: The high cost of inference leads to a proprietary language for the model

05. Why did LLMs bring AGI instead of AlphaZero?

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Building a large-scale AI recommendation system from 0: a key link in the productization of sorting models

Agency, Judgment, Intuition and Artificial Intelligence: Reflections on Insight 2025

Taobao does flash sales, not Ele.me again

DeepSeek has been popular for 100 days, and the big factory has regained its original intention

The 10,000-word resignation letter swiped the screen, and the boss thought to himself: Zhen, do you really think you are the chairman?