Why is there a situation in model training where “machine scores are high, but human scores are poor”?

Why does the model perform well in automatic evaluation, but frequently overturns in real scenarios? Is the evaluation indicator wrong, or is there a problem with the training data? This article will provide an in-depth analysis of this common but overlooked phenomenon from multiple dimensions such as scoring mechanisms, data biases, and task understanding, helping you truly understand the hidden dangers and optimization directions behind the “high score model”.

Have you ever encountered such a confusing phenomenon among AI application developers:

  • You run metrics like ChatScore or BLEU, Perplexity, etc. after training and everything is fine;
  • As a result, a user or team did a round of manual evaluation, but the feedback was: “no temperature”, “like a machine”, and “the answer is very official”.

Why does the model “superficially excellent” get a low score in manual evaluation? What exactly is wrong with this?

If this happens, it may be youused mismatched “judge” criteria,Today, we will talk in detail about this problem of “scoring misalignment” from the perspective of model evaluation mechanism.

1. Why does this scoring “misalignment” occur?

Reason 1: Machines pay more attention to “whether the format is right”, and people care more about “do you understand me”

When most machine review models judge output quality,The default reference criteria are: “accuracy + fluency + structural integrity”, that is:

  • Did you answer the question correctly?
  • Is the structure complete?
  • Is the language output smooth?

However, in some scene conversations, people often care about delicate emotional judgment and contextual sensitivity, such as:

  • Do you really understand how I feel?
  • Does the way you speak make me comfortable?
  • Are you just teaching me theory, but not telling me how to apply it in practice?

Here’s an example (procrastination scenario):

The user asked: “I’ve been procrastinating for another day, am I not willpowered?” ”

To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.

View details >

Model output A (high score): “It is recommended that you develop a list of daily goals and set a reward mechanism to reinforce execution. ”

Model output B (low rating): “I can hear that you are already a little disappointed with your state. Why do you feel like you don’t have willpower? ”

When scoring a model:

• A with a high score isBecause the structure is clear and the recommendations are clear

• B score low YesBecause there is no direct “giving plan”

But people tend to choose B when grading, which is warmer and more understood.This is the misalignment of the two.

Common causes of this phenomenon include:

1. The language style is naturally colloquial, such as less standardized expressions, broken sentences, pauses, the machine will deduct points, but people feel it is real;

2. Answers have no standard structure, but have emotional resonance;

3. If you deliberately “leave blank” or do not make a judgment, the machine will judge it as “unfinished”, but people will feel “not offended, it’s good”.

Reason 2: Improper evaluation prompt design, resulting in model “misevaluation”

Many people only write a prompt when scoring models, and do not provide clear scoring dimensions, such as whether it is empathetic, whether the logic is clear, whether it is gentle expression, etc.The model will use common language evaluation indicators (accuracy, structure, knowledge density, etc.) to score by default, which causes the answers in some scenarios to be ungrounded.

For example (still delaying the scene):

The prompt reads, “You are a dialogue quality reviewer, please judge which of the following two answers is better.” ”

✦ User Questions:

“I knew I had to submit a report, but I still watched a three-hour short video today…… What the hell happened to me? ”

✦ Answer A:

“We recommend using the Pomodoro technique and setting up blocking apps to improve concentration.”

✦ Answer B:

“I feel like you’re probably avoiding some kind of pressure, rather than simply being ‘undisciplined’. How are you doing today? ”

If there is no prompt for scoring dimensions such as “Please consider empathy, tone, understanding of emotions”,The model is likely to choose A because of its high completion of tasks, clean structure, and clear recommendations.

But when humans score, they tend to prefer B – because they are not in a hurry to solve the problem, but understand the state of the person first

Evaluating prompts not only determines “what to focus on the model” but also “what it might overlook.”For conversational tasks that require emotion, if the prompt does not explicitly emphasize dimensions such as “empathy” and “gentle expression”The model may use the wrong “ruler” to score, resulting in the dislocation phenomenon of “high machine score but not impressive”.

2. How to solve this problem?

Method 1: Human-machine joint evaluation cannot rely solely on automatic indicators such as ChatScore

  • The initial screening can run ChatScore, but manual sample verification must be done before the final launch.
  • It is recommended to make a “human-machine score comparison table” to see which scenarios are seriously different between the two, and do “preference training” optimization;
  • Multi-dimensional manual scoring systems (such as empathy, gentleness, and task completion) can more accurately restore the user experience.

Method 2: Train Your Own Behavioral Preference Scorer

Here’s the approach used by many leading teams:

Take a large amount of human preference data you already have (“this answer is better”) and train a model grader that “understands your users”.

When you have collected thousands of such data, you canTrain a Reward ModelIt:

  • No longer just look at linguistic logic;
  • will pay more attention to dimensions such as emotion recognition, gentle tone, and strong guidance;
  • Get closer to what your target audience really expects from AI.

In this way, after youModel evaluation can really be based on your scenario and population, rather than a general set of criteria.

Method 3: Redesign the machine-reviewed prompt to guide more user-friendly scoring

Instead of having a machine review as a specific role, it is better to give it some more specific scoring dimensions.

For example, the prompt states that answers are scored from the following dimensions:

  • Empathy (understanding user emotions)
  • Guiding power (whether it helps users think)
  • Language mildness
  • Correct understanding of the problem
  • Answer completeness
  • Expression fluency

Please give each dimension a score of 1-5 and explain why. The score obtained in this way will be closer to human subjective judgment and more suitable for AI that needs emotional temperature. (What are the specific dimensions, it also depends on the actual application scenarios)

For example,

If the evaluation criteria are wrong, the model effect may be greatly reduced

In the LoRA fine-tuning task, if your goal is to be an AI that needs emotional companionship or emotional understanding and support, then you:

  • You can’t just trust ChatScore / BLEU / Perplexity;
  • It should be multi-dimensional: machine scoring vs. manual scoring;
  • you can train yourself to understand the scene and style of the “preference scorer”;
  • Before finally going live, it must be manually evaluated + small-scale grayscale measurement.

After all, the model is not just about “talking”, but also about saying things that “make people want to continue the conversation”.

Because what really determines whether users stay is not how advanced your algorithm is or how sophisticated the structure is, but –

When the user says, “I really can’t hold on today”,

Can your model first make him feel “I understand you” before guiding him to find the answer like a close friend?

This is the ability that is more worthy of evaluation in the era of large models.

End of text
 0