AI only learns to reason with “confidence”, and Zhejiang University alumni replicate DeepSeek’s long chain of thought emerges, reinforcing learning without external reward signals

The UC Berkeley team has proposed a new method, Intuitor, which allows AI to optimize inference through its own confidence level without external rewards. This method improves the model’s performance in tasks such as mathematical reasoning and code generation, reduces the risk of “rewarding hackers”, demonstrates the advantages of multi-task generalization, and provides new ideas for reinforcement learning of large models.

Replicating DeepSeek-R1’s long chain of thought reasoning, large model reinforcement learning new paradigm RLIF has become a hot topic.

The UC Berkeley team co-authored Xuandong Zhao called this achievement:

Large models do not need to be exposed to real answers, and can learn complex reasoning only by optimizing their confidence.

Specifically, the new method does not require external reward signals or annotated data at all, but only uses the model’s own confidence level as an intrinsic reward signal.

How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?

View details >

Compared with using external reward signal GRPO, the new method improves the performance of the foundation model and performs better on code tasks without requiring standard answers on mathematical tasks.

At about the same time, another paper, RENT: Reinforcement Learning via Entropy Minimization, also verified a similar conclusion.

The authors say the main difference between the two is the use of KL divergence and minimized entropy to measure confidence.

Dropbox’s vice president of engineering said: Confidence is all you need.

“Self-confidence” driven reinforcement learning

For a long time, training large models mainly relied on two methods:

Either requires a lot of manual annotation (e.g., ChatGPT’s RLHF) or verifiable, standard answers (e.g., DeepSeek’s RLVR).

The former is costly and can introduce bias, while the latter is limited to fields like mathematics, programming, etc., where there are clear answers.

So when AI capabilities gradually approach or even surpass humans, can the model get rid of its dependence on external supervision based solely on its own internal signals?

In response to this problem, the UC Berkeley team proposed a new training methodIntuitor, calculates the KL divergence between the predicted distribution and the uniform distribution of the model as the “confidence degree”.

It is equivalent to when humans do questions, if they are sure of the answers, their thoughts will be clearer, and they often need to rethink when they lack self-confidence.

By optimizing this intrinsic signal, INTUITOR encourages the model to generate its own “more confident” answers, which also prompts the model to generate a more structured reasoning process.

In the experiment, the small models of 1.5B and 3B also emerged with long chain-of-thought reasoning behaviors similar to those of DeepSeek-R1.

The paper also points out that intrinsic reward signals also have an additional benefit: they mechanically reduce the risk of “reward hacking”.

Reinforcement learning of traditional external reward signals is prone to “exploitation”, such as the model may generate syntactically correct but logically wrong code to match test cases, or directly memorize answers instead of reasoning in math problems.

In INTUITOR, the team found that if you use offline learning, the model also learns to cheat when trained for about 100 steps: attach a simple solved question to the answer to improve the confidence score.

But this problem can be avoided by using online learning, and the evaluation criteria evolve with the model’s capabilities, and cheating strategies become ineffective.

Experimental results: Not only can you do questions, but you will also draw inferences from one example

The team first empirically studied the improvement of LLMs’ mathematical reasoning ability by the INTUITOR framework.

Experimental selectionQwen2.5-1.5B/3BAs the base model, self-certainty is used as the only reward signal and placed separately inINTUITORand two baseline methods (GRPO, GRPO-PV) in the pre-training of the MATH dataset.

Using dialogue prompts, 128 questions are processed each time and 7 candidate solutions are generated, and the KL penalty factor is set to 0.005.

Performance evaluation in benchmarks for mathematical reasoning, code generation, instruction following, and the results are shown in the image:

Experiments show that after fine-tuning through INTUITOR, Qwen2.5-1.5B changes from only outputting repetitive meaningless content with a dialogue task score of less than 10% to a significant reduction in invalid output and an effective increase in response length.

In terms of structured reasoning capabilities, the team also found that INTUITOR has a faster early learning speed, such as Qwen2.5-3B INTUITOR (0.811) is always better than GRPO (0.758) on the GSM8K benchmark.

In addition, INTUITOR also excels in multi-task generalization, for example, when Qwen2.5-3B lags behind in code generation tasks, it continues to grow, and the final performance is 8% higher than GRPO, a relative increase of 65%.

The team also observed that when performing long-chain inference, the INTUITOR model will add natural language inference (e.g., “To solve problem X, step Y needs to be executed first”), which may be one of the reasons why INTUITOR has always performed well in testing.

Its evolution can be roughly described as three stages: the model learns to generate code, achieving improved accuracy and reduced invalid responses. Engage in pre-code reasoning to promote self-understanding. Gradually refine to generate effective code with detailed reasoning.

To assess the robustness of self-certainty as a reward, the investigators also compared offline self-certainty (rewards from fixed foundation models) to online self-certainty (rewards from evolving strategy models).

Experiments show that offline rewards will cause the accuracy to collapse after 100 steps by adding irrelevant content, while online rewards and strategies can effectively prevent cracking.

To further assess the quality of self-certainty as a reward signal, the researchers also analyzed the distribution of self-certainty scores generated by the model in MATH500 responses.

It is worth noting that the INTUITOR model has a significantly higher self-certainty for correct answers, while the GRPO improves the self-evaluation ability of the model, but the discrimination is significantly lower than that of INTUITOR.

Due to the constraints of computational resources, the experiment is only trained on a relatively small unsupervised corpus, and the advantages of INTUITOR can be further investigated on larger-scale foundation models and more diverse real-world datasets in the future. Team Introduction

This study comes from the team of Sergey Levine and Xiaodong Song of UC Berkeley, and there are a total of five authors, namely Xuandong Zhao, a postdoctoral researcher of the first author, Zhewei Kang, an undergraduate student of the first work, Aosong Feng from Yale University, and Sergey Levine and Dawn Song.

After graduating from Zhejiang University in 2019, Xuandong Zhao entered the University of California, Santa Barbara to study for a doctorate in computer science, during which he also interned at companies such as Alibaba, Microsoft and Google.

Since joining UC Berkeley in 2024, he has published more than ten papers in addition to this new achievement, which have been accepted by ICLR 2025 and ICML 2025.

In addition, in February this year, Xuandong Zhao and Zhewei Kang also collaborated to publish a paper describing the new strategy Best-of-N for improving the reasoning ability of LLMs based on self-determination, which can be regarded as a priori attempt in this paper.

Paper link: https://arxiv.org/abs/2505.19590

Code link: https://github.com/sunblaze-ucb/Intuitor

Reference links:

[1]https://x.com/joshclemm/status/1927400772817285264

[2]https://x.com/xuandongzhao/status/1927270931874910259

[3]https://x.com/xuandongzhao/status/192778163679341780

[4]https://arxiv.org/abs/2502.18581

AI only learns to reason with “confidence”, and Zhejiang University alumni replicate DeepSeek’s long chain of thought emerges, reinforcing learning without external reward signals

“Self-confidence” driven reinforcement learning

Experimental results: Not only can you do questions, but you will also draw inferences from one example

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Stop staring at dialogs! AI design “two-dimensional matrix” used by top products

AI赛道又添新专业，志愿填报该不该冲？

I try to analyze the simple logic behind the revision of the official account

【In-depth】4 horizontal reviews of intelligent guide inspections in the physical examination industry, who is behind an era?

What should I do if the promotion of Pinduoduo through train is limited?