AI Product Manager Compulsory Course: How to Improve the Certainty of AI Agent Output?

The generated content of AI agents is highly uncertain, how to improve the certainty of output? This article delves into the stability of AI-generated content, analyzes the key factors affecting agent output, and provides practical strategies to help product managers optimize the AI interaction experience and make agent answers more accurate and reliable.

AI agents paint an exciting future for us. But beneath this bright prospect, a huge challenge is quietly emerging: can we trust these seemingly intelligent agents 100%? Are the answers they give really reliable?

1. Think back then, software “one is one, two is two”: the “iron law era” of code

Remember those software we used to use? Like Word, Excel, or the games you play with clear rules. These can be called the “Software 1.0” era. They are characterized by:All actions are subject to command, and all results can be expected.

Like a precision clock, every gear, every hand runs meticulously according to preset rules. Developers need to be like architects, building entire programs brick by brick (line by line of code). For example, if you make a calculator and you tell it “2+2”, it will always tell you “4”. You enter a customer’s order information, and it will generate an order without error, no more than one decimal point, no more than one zero.

The core task of many programs is data processing, such as the banking system processing your deposits and withdrawals, as long as the algorithm is designed to be flawless, the accounts will not be wrong. As long as your instructions (algorithms) are correct and the input data is the same, the output results will always be the same, as stable as an old dog!

To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.

View details >

Software in this era can be slow to develop, like building a huge castle with building blocks, requiring patience and meticulousness. And they are not very “smart”, it will do what you ask it to do, and it will not draw inferences,But the biggest advantage is high certainty.

2. The appearance of AI large models: “the strongest brain” or “casual goblin”?

Then, the era of AI came, especially large language models (LLMs) represented by ChatGPT. It’s like we suddenly have a super brain, which reads thousands of books (massive data), can write poetry, draw, chat, program, and can do anything! “AI agents” are built on these large models, hoping to make software as intelligent as humans.

But the problem also arises: this “super brain” is sometimes a bit “casual”.

You ask it the same question and ask it again every few minutes, and it may give you a slightly different answer. This is unthinkable in the software 1.0 era. Traditional software is based on the explicit “if… So… (If … Else… rule. The internal decision-making process of large AI models, especially LLMs, is more like an extremely complex “black box” and generally lacks interpretability. They generate content by learning patterns and associations in massive amounts of data, a process that carries a certain degree of probability.

For example, software 1.0 is like a recipe, you strictly follow the steps and ingredients, and the taste of the dish is the same every time. The AI model is more like an experienced chef who occasionally likes to improvise, you tell him that he wants to eat “fish-flavored shredded pork”, he may put more vinegar today, maybe more chili peppers tomorrow, the taste may be good, but it is always a little different.

Why does AI “roll the dice”? To put it simply, when generating each word, LLM does not choose the only “correct” one, but selects it according to probability from a bunch of candidate words. This is like when we speak, the same meaning can be expressed in many different ways.

This “uncertainty” is an innate characteristic of AI agents.

3. What are the tricks to improve certainty?

In the face of this “casualness” of AI, we need to try our best to put the reins on this “wild horse”, improve its “certainty”, and make it more “obedient”. There are four commonly used methods:

  1. Prompt Engineering: “Speak well, speak clearly!” ”It’s like communicating with someone, the clearer and more specific your words are, the more accurately the other person will understand what you mean. The same is true for AI. By carefully designing the way to ask questions (prompts), AI can be guided to give more accurate and expected answers. For example, you can’t just say “write a story”, but “write a short adventure story about a puppy looking for bones, in a lively and interesting style, with a word count of about 500 words”.
  2. Few-shot Learning: “Look, it’s like this!” ”When asking questions, give the AI a few examples of correct answers and let it “follow the cat”. For example, if you want AI to help you classify emails, you can give it a few examples: “This is spam” and “This is an important email”, and the AI can better learn your classification criteria.
  3. Fine-tuning: “Special training, there is a specialization in the art industry!” ”Train a general AI large model with data from a specific field to become an expert in a certain field. For example, if a model is fine-tuned with a large number of legal documents, it can perform more professionally and stably in drafting contracts and analyzing cases.
  4. Retrieval Augmented Generation (RAG): “Check the information before speaking!” ”Before AI answers a question, let it retrieve relevant information from a reliable knowledge base (such as internal company documents, authoritative textbooks) and then organize the answers based on this information. This can greatly reduce the probability of AI “nonsense”.

These methods have improved the stability and reliability of AI output to a certain extent, and strive to improve the certainty of AI from “occasionally reliable” to “basically usable”.

But obviously these common methods are not enough, we are pursuing higher certainty, the following are 9 technologies that you may not have a deep understanding of, I will try to explain the principles of these technologies in the vernacular:

1. Temperature Scaling Regulation

Technical principle:

When an LLM generates text, it is actually predicting the next most likely word (or token). It calculates a probability distribution for all words in the vocabulary. The temperature parameter is like a knob that adjusts the “shape” of this probability distribution.

  • Low temperature (e.g. 0.1-0.5):It will make the probability of high-probability words higher, the probability of low-probability words lower, and the distribution more “sharp”. The model will be more inclined to choose those words that it thinks are most likely.
  • High temperature (e.g. 0.8-1.0 and above):It will make the probability difference between different words smaller, and the distribution will become more “flat”. The model will have a greater likelihood of choosing some less common words, increasing the randomness and creativity of the output.

How to improve certainty:

Set the temperature parameter very low (e.g. close to 0). In this way, the model will almost always choose the word with the highest probability at each step of generation. This makes the output sequence tend to be fixed for the same input, which greatly improves determinism. Many API interfaces allow temperature=0 to be set to achieve the closest to deterministic output.

Simple analogy:Imagine you touch a ball in a raffle box with lots of options, and the low temperature is like making the ball that is most likely to win extra big and heavy, and you touch it almost every time.

2.Top-K Sampling

Technical principle:

When the LLM selects the next word, Top-K sampling sorts all possible words from highest to lowest, and then selects only the K words with the highest probability (usually combined with temperature parameters). For example, if K=5, the model will only pick one of the five words with the highest probability as the next word.

How to improve certainty:

By limiting the number of candidate words, Top-K avoids the model from choosing from extremely low-probability “long-tail” vocabulary that often results in irrelevant or strange outputs. When the K value is small and combined with low temperatures, the selection range of the model is greatly reduced, and the certainty of the output is improved. If K=1, it becomes a greedy search, always choosing the one with the highest probability and the highest certainty.

Simple analogy:

Or the raffle box, Top-K sampling means that you are only allowed to draw a few (K) top prizes with “1st prize”, “2nd prize”, and “3rd prize” written on it, instead of randomly drawing all the various prizes.

3. Top-P Sampling (Nucleus Sampling)

Technical principle:

Top-P sampling is more flexible than Top-K. Instead of choosing a fixed number of K words, it chooses a minimum set of words that are likely to add up to a certain threshold P (e.g., P=0.9). That is, the model selects words from high to low probability until the probability sum of these words exceeds P, and then samples from this dynamically sized set of words (called “core kernels”, Nucleus).

How to improve certainty:

When the P-value is set low (e.g., P=0.1 or lower), the core kernel will be very small, possibly containing only one or two words, greatly reducing the model’s options and increasing the certainty of the output. Its advantage over Top-K is that when the model is very certain about the next word (the probability of a word is much higher than others), the core kernel automatically becomes smaller; When the model is less certain (multiple words are similarly probable), the core kernel becomes appropriately larger, but still excludes a large number of low-probability words.

Simple analogy:

When you go to a cafeteria to eat, top-P sampling is like the waiter telling you, “You can take whatever you want, but the dishes you take can’t add up to the P ratio of the total budget (for example, 90% of the value of the dishes are in the popular area). If P is small, you can only choose from a handful of the cheapest or most signature dishes.

4. Beam Search

Technical principle:

Unlike greedy search, which selects only the best word at each step, bundle search will keep B candidate sequences with the highest probability (B is “beam width”, beam width or beam size) at each step of the generated sequence. In the next step, it expands on these B sequences separately and selects the new B sequences with the highest total probability. Finally, the one with the highest probability is selected from this complete sequence B as the output.

How to improve certainty:

Beam search is more likely to find globally optimal or near-optimal sequences by exploring a broader search space, rather than focusing only on local optimal. When the beam width B is larger, it produces higher-quality, more coherent text. While it is not intended for strict certainty per se (if multiple sequences have the same probability, the selection may be different), it exhibits higher “de facto certainty” than random sampling methods that are more likely to converge to similar high-quality outputs for the same inputs and model states. If B=1, the bundle search degenerates into a greedy search.

Simple analogy:

When playing chess, greedy search is best to only see how to go next. Bundle search is the best way to take B in the next few steps at the same time, and then choose the path that looks optimal overall.

5. Constrained Decoding / Guided Generation

Technical principle:

This technique enforces that the output of LLMs must conform to predefined rules, patterns, or vocabulary at every step of generating text. These constraints can be:

  • Syntax constraints:For example, the forced output must be in legitimate JSON format, XML format, or conform to the syntax of a specific programming language.
  • Lexical constraints:The restriction model can only select words from a specific subset of vocabulary.
  • Regular expression constraints:The output must match a regular expression.
  • Semantic constraints:For example, require that the output must contain certain keywords or not contain certain sensitive words.

How to improve certainty:

By imposing external constraints, the “free play” space of LLMs is greatly reduced. If the constraints are strong enough, such as requiring the generation of a JSON object with a fixed structure, then the model still has some flexibility in filling values, but the certainty of the overall structure is guaranteed. This is crucial for Agentic tasks that require output in a specific format, such as API call parameter generation.

Simple analogy:

Just like playing a crossword puzzle, you not only have to think about the meaning of the word, but also make it fit into the grid and meet the requirements of other words horizontally and vertically.

6. Output Caching

Technical principle:

For the exact same input (prompt), the system can cache the first (verified or user-satisfied) output generated by the LLM. When this identical input is received again in the future, the system returns the output directly in the cache without calling the LLM again.

How to improve certainty:

This is the easiest and most straightforward way to achieve “absolute certainty”, but only if the input is completely repeated. This is a very effective strategy for common queries that require stable results.

Simple analogy:

You ask a friend a question and he answers you. The next time you ask the same question, he will directly repeat the answer from last time.

7. Output Parsing & Validation/Repair

Technical principle:

After the LLM generates the output, the system uses a parser to try to convert it into structured data (e.g., JSON, objects). The Validator then checks whether the structured data conforms to a predefined schema or business rule. If the resolution fails or the check fails, the system can:

  • Request LLM rebuild:It may be accompanied by an error message that leads the LLM to correct it.
  • Try automatic fix:For some minor errors, the system may be able to correct itself.

How to improve certainty:

While this does not guarantee that the original output of the LLM is deterministic, it guarantees that the output that ends up in the downstream process is in accordance with the format and basic business rules, thereby improving the “effective determinism” and robustness of the entire system.

Simple analogy:

You submit an assignment, and the teacher (parser and checker) will check the format correctly for typos, and if not, call back for you to rewrite or help you fix minor mistakes.

8. Iterative Refinement & Self-Critique

Technical principle:

Let the LLM generate a preliminary answer, and then let it (or another LLM instance) critically evaluate the answer, pointing out any problems, deficiencies, or non-compliance. Then, based on these critical comments, the original LLM revises and refines the answer again. This process can be iterated multiple times.

How to improve certainty:

Through multiple rounds of iterations and self-correction, the output gradually converges to a more consistent and higher-quality result. While each step of the LLM call remains random, the entire iteration process acts like a negative feedback loop, helping to “pull” the results back to the desired track, thereby showing greater stability and consistency on a macroscopic scale.

Simple analogy:

Imagine that you are writing an essay, read it yourself after writing the first draft, find problems, and revise it; Read and revise again until you are satisfied.

9. Hybrid Systems (Rule-Based + LLM)

Technical principle:

Combine deterministic, rule-based traditional software modules with the cognitive capabilities of LLMs. For those parts of the task that can be handled by explicit rules, they are implemented using traditional code to ensure their certainty. For parts that need to understand natural language, perform complex reasoning, or generate creative content, call LLMs.

How to improve certainty:

By clearly delineating task boundaries and limiting the use of LLMs to those areas where it does best and where uncertainty can be tolerated or managed, the core logic and critical data processing of the system remain the responsibility of the deterministic modules. This greatly reduces the risk of overall system failures due to LLM uncertainty.

Simple analogy:

A robot chef, cutting and weighing these precise tasks are completed with a robotic arm (rule module) to ensure that every minute is perfect; while the innovation of dishes and taste matching refer to the suggestions of the AI brain (LLM module).

These technologies are often not used in isolation, but are used in combination depending on the specific application scenario and the degree of certainty required. For example, low-temperature sampling, top-P sampling, and output parsing and verification can be used at the same time to achieve the best balance.

In conclusion, while achieving 100% predictable certainty in all cases like traditional software is still a huge challenge, through the combined use of these techniques, we have been able to control and guide the behavior of LLMs to a large extent, making their outputs more reliable and consistent on specific tasks. Future research will continue to deepen in this direction to seek better solutions.

4. The road to the future: to create an AI agent that is both “supernatural powers” and “as stable as Mount Tai”

AI agents will profoundly change our work and life. But the hurdle of “certainty” is what we must overcome towards this better future.

This does not mean that we are stifling the creativity and flexibility of AI. The key lies in “scenario-based application” and “risk control”.

  • For creative, exploratory tasksFor example, writing poetry, painting, brainstorming, the “uncertainty” of AI is an advantage and can bring more surprises.
  • For serious, high-precision tasksFor example, in medical care, finance, and automatic control, we need stricter “reins”, and may even need a dual insurance mechanism of “AI + manual review”, or develop a new type of AI architecture with stronger deterministic logic.

The future development of AI is likely to be a path of “rules and probability dance together”.

We must not only use the powerful induction and generation capabilities of large models, but also find ways to embed more structured knowledge and logical reasoning, so that AI can be “down-to-earth” while being “imaginative”.

Perhaps, we will see more hybrid AI systems emerge, which cleverly combine the rigorous logic of traditional software with the cognitive intelligence of AI large models.

The ultimate goal is to cultivate AI agents who can not only solve complex problems with “clever calculations” and ensure reliable results.

End of text
 0