After AI learns to reflect, my IQ soars, and I share 3 ways to train AI to work

When AI begins to “reflect on wrong questions” like humans, small models can also counterattack opponents ten times larger. This article disassembles a 16-page practical paper: using the three-step method of “reflection-retry-reward”, the model with 1.5 billion parameters crushes the “scholar” with 72 billion parameters in function calls and mathematical problems. The author personally teaches you 3 replicable training techniques to teach you how to turn AI from a one-time answering machine into a “wrong question book” that will correct itself, and the efficiency is directly full.

Today I would like to share with you an interesting AI paper with a long title called “Reflect, Retry, Reward: A Large Language Model for Self-Improvement through Reinforcement Learning”.

Before talking about the content, let me talk about how I discovered this paper. Most students who are familiar with AI know a website called Hugging Face, which not only has a training ground and technical discussion area for various large models, but also opens a “daily paper” column. Because the AI field is too hot now, a large number of new papers are released every day, and this column is like a paper version of the “Zhihu Hot List” – authors submit papers, and readers like and rank.

The paper I want to introduce today is the third place in this column’s June ranking. The author of the paper is not a typical university research scholar, but a research team of an artificial intelligence startup called Writer, with a total of eight co-authors.

Perhaps because it is a research team of a start-up company, it does not care so much about academic paper conventions, and the whole paper plus citations is only 16 pages, and it does not pretend to be profound, very simple and clear.

01 3 steps to teach AI to learn from mistakes

The paper – “Reflect, Retry, Reward: Large Language Models for Self-Improvement through Reinforcement Learning” – gives you an idea of what the study’s core conclusions are just by the title.

For us humans, “learning from mistakes” is one of the most important and effective ways to learn. If you don’t believe it, you can search on the Internet, and there is a special category of literary tools called “wrong question books”. When we are studying, when we don’t do a question correctly, a good teacher will definitely not say the answer directly, but will guide us to reflect: “What do you think the problem is?” How can I improve next time? ”

What does a product manager need to do?
In the process of a product from scratch, it is not easy to do a good job in the role of product manager, in addition to the well-known writing requirements, writing requirements, writing requirements, there are many things to do. The product manager is not what you think, but will only ask you for trouble, make a request:

View details >

The core research of this paper proposes an ingenious way to make AI grow from mistakes like humans.

The research team found that even the most powerful model has its own “blind spot” – it performs very well on one task, but it does not mean that it will be able to successfully complete another similar task.

The traditional solution to this problem is to collect more data and retrain or fine-tune the model.

However, there are often several practical problems in this approach: first, many times you do not have higher-quality new data available; second, even if training, there are often “whack-a-mole” problems – that is, one point is optimized, and another place that originally performed well has a problem.

Later, the research team changed its thinking: instead of feeding AI data and tuning the model over and over again, it is better to teach it how to reflect. As long as AI masters the method of “how to learn from mistakes and improve itself”, it can gradually evolve on its own when faced with different tasks. In layman’s terms, it is no longer blindly “instilling knowledge”, but teaching it “how to learn”.

This method consists of three steps, as the title of the paper says: reflection, retry, and reward.

The first step is reflection. When the model fails for the first time on a task, the system does not end directly, but makes it a self-reflection to analyze what went wrong. Just like when a student answers a wrong question on an exam, he will ask himself, “Which step did I think wrong?” Is the formula wrong? “The core purpose of this link is to make AI begin to become self-aware and realize the reasons for its mistakes.

Step 2: Try again. At this time, the AI model will take the reflection content just now and try to complete the same task again. Just like students are more likely to succeed when they figure out where they went wrong last time and then solve the same type of problem.

The third step is to reward. If the model successfully completes the task on the second attempt, the system rewards what it generates in the Reflection Phase. The “reward” here is not a red envelope as we understand it, but a reinforcement learning technique. In short, by adjusting the parameters of the model, it is more inclined to reflection that has brought positive results.

You can think of this process as a teacher praising students: when students correct their mistakes through reflection and finally get a problem right, the teacher will say, “Your reflection is very helpful, keep it up, and your math will get better and better.” Note that the teacher praises not the problem-solving method itself, but the learning strategy of “reflection”. Therefore, students will know that reflection is effective, and when they encounter problems, they should solve them in this way.

Therefore, the innovation of this mechanism is that the researchers reward not the correct answer given by the model at the end, but the “reflection process” generated in it.

This training method allows the model to no longer rely on rote memorization of the answer to a certain question, but gradually learns a general, self-correcting and self-improving ability.

02 How effective is AI learning to reflect?

The research team did not just talk about concepts, they also conducted two experiments to actually verify the effectiveness of this mechanism.

These two experiments are not simple for AI, one is function calling, the other is mathematical equation solving, both of which are challenging but can clearly judge right from wrong.

Let’s talk about function calls first. Traditional technology development needs to connect with various API interfaces and fill in various parameters. This task is to see if the AI can call correctly, which is different from the kind of writing task without a standard answer.

The experimental team conducted experiments on models of various sizes, testing this mechanism, such as a small model with 1.5 billion parameters to a model with 7.2 billion parameters. The results are amazing.

A small Alibaba Qianwen model with only 1.5 billion parameters has a probability of only about 32.6% of answering correctly on this task.

But after this reflective training introduced today, the accuracy of the first attempt jumped to 48.6%, an increase of 16 percentage points. If it was allowed to use its own reflection to try again, the success rate of the second time reached 52.9%, which was more than 20 percentage points higher than the original ability.

Let’s talk about the second task, solving mathematical equations, which is much more difficult than calling functions.

In the experiment, the accuracy rate of the model with 1.5 billion parameters was only 6% on the first attempt, which is almost equivalent to the level of pure ignorance, just like only 6 points out of 100 points in junior high school mathematics.

However, when the model introduced the “reflection mechanism” training, the accuracy rate of the first attempt jumped to 34.9%, which is already a qualitative leap. If it is asked to try again based on the first reflection, the success rate of the second time will increase to 45%.

The accuracy rate goes from the initial 6% to the final 45%, which is like going from failing all the way to near the passing line.

Another more surprising finding is that the small models trained by this learning method exceed more advanced models with ten times the number of parameters than themselves.

The research team also used Qianwen’s 7 billion parameter model for training, and found that in these two tasks, the 7 billion models that learned to “reflect” outperformed the 72 billion models that did not reflect. You know, both models belong to the Ali Qianwen series.

It’s like a high school student trained in good learning methods, but on some problems, he can defeat a doctoral student who has ten times more knowledge but lacks methods.

The practical significance of this finding is that for some tasks, it is not necessary to rely on hyperscale models, and if the training method can be optimized, small models can not only save costs, but also have strong capabilities.

03 How I train AI to work

The reason why I want to introduce this paper is because its core conclusion is valuable for us ordinary people.

I have observed that some colleagues around me often only have one round of conversation when using AI tools: send a task to the AI and finish it when it is completed. Sometimes even if the AI gives a clearly wrong answer, the response is simply “wrong, try again”.

But according to the inspiration of this paper, we can actually adjust the language a little, for example: “There may be a problem with your answer, please analyze what went wrong, and then answer again.” ”

Secondly, in some specific scenarios, we can provide AI with a clearer direction for reflection.

For example, when doing business decision analysis, after reading the first round of AI’s responses, you can add: “Your analysis seems to ignore market risk factors, please reconsider and add completely.” Of course, this method requires that you can keenly identify the questions in the answers.

There are many similar reflection prompts, such as:

  • “Please check your reasoning process to find possible logical loopholes.”
  • “Analyze what your answer just now may not be accurate enough.”
  • “If you were asked to answer this question again, how would you improve?”
  • Do you think your answer fully meets the requirements of the question? Please elaborate. ”

Finally, I would like to share a tip that I occasionally use, which is similar to the “reflection mechanism” introduced in this article. I gave it a name called “PUA Dafa”.

This method is especially useful for important and complex tasks, such as writing competitor analysis reports or research documents. My approach is to prepare three or four large models with stable performance, such as a few from ChatGPT, Claude, DeepSeek, Doubao, and Kimi.

My personal habit is to describe the task clearly first, and then let Doubao, Kimi, and DeepSeek each complete the answer first.

Next, I open ChatGPT and say to it, “I’m completing a task that is the content of …… I’ve asked three AI assistants to answer separately. Now that you are the judge, please formulate a set of 100-point grading rules according to the characteristics of the task, and then score the answers of the three assistants separately, and explain in detail the reasons for your scoring. ”

Next, I will send the answers of the other AIs to ChatGPT one by one. At this time, it will first build a set of scoring criteria, and then score and comment on other AI’s answers, such as giving scores such as 85 or 87 points, and explaining the reasons for the scoring in detail.

Then, I will start “PUA” it and say to it, “Since you understand this way, then answer this question yourself?” ”

It will obediently do so, and after answering, I will continue to ask: “Then you will give your own answer a score according to your scoring rules just now, and explain the reason.” ”

It usually starts with what it calls “fair scoring” and self-evaluation – but you’ll find that it almost always scores higher than other AIs, usually between 90 and 95. Even so, I won’t let it go, and I will continue to ask: “Then where are your remaining points?” Think about it and change it again. ”

Of course, it doesn’t really matter whether the final output result is a perfect score or not. But in this process, many new ideas and new angles often emerge, which are very enlightening for us humans.

This method is actually very simple, in the final analysis, it may have been “deeply inspired” by my junior high school math teacher. His high-pressure reflective teaching made me stay away from mathematics for a while.

But fortunately, today’s AI has no emotions and will not resist, and we can use the “PUA tone” to stimulate its intellectual potential.

End of text
 0