Building a large-scale AI recommendation system from 0: How to define an effectiveness evaluation system?

In today’s digital age, building large-scale AI recommendation systems has become an important development direction for many products. However, measuring the true value of recommendation systems and driving business growth is a significant challenge for product managers. This article delves into the method of building a recommendation system effect evaluation system from scratch, hoping to help everyone.

A core and ongoing challenge for product managers responsible for large-scale AI recommendation systems is how to establish an evaluation system that can truly measure the value of the system and effectively drive business development. This system cannot only stop at the level of technical indicators, it must be deeply integrated into the core strategy of the product, which can not only guide the optimization direction of the algorithm team, but also clearly demonstrate the contribution of the recommendation system to business goals.

This requires product managers to go beyond the fascination with a single metric and gain a deep understanding of the complex and dynamic interplay between technical performance, user experience, and business objectives. By carefully designing a multi-dimensional indicator matrix to comprehensively evaluate value, building a clear indicator mapping link to ensure the correct optimization direction, relying on a rigorous AB test platform for scientific decision-making, and using the “Polaris + Guardrail” indicator combination to achieve a healthy balance, this evaluation system can become a powerful engine driving the continuous evolution of the recommendation system.

1. Build a multi-dimensional indicator matrix

The value of recommendation systems is diverse, and single-dimensional evaluation is prone to bias. We need to establish a three-dimensional indicator system that covers all levels from the technical bottom layer to the user experience.

1. Basic technical indicators

Accuracy metrics:This is the core capability of the recommendation system. Commonly used ones include:

  • Precision:What percentage of items are recommended to users that users are genuinely interested in (e.g., clicks, purchases, views)?
  • Recall:What percentage of items that users may be interested in are successfully recommended by the system?
  • F1 value:Consider the balance point indicator of accuracy and recall rate comprehensively.
  • Root Mean Square Error (RMSE):In scenarios where you need to predict user ratings, such as movie ratings, measure how much the predicted value deviates from the actual value.
  • Practical considerations:These metrics must be defined closely to the specific scenario. For example, in e-commerce, “interested” is often defined as the user’s purchase behavior; On the content platform, it may be effective reading or watching. It is necessary to clarify the definition criteria of “positive sample” (of interest to users) and pay attention to the impact of data sparsity on the calculation of indicators.

Diversity indicators:The key to preventing information cocoons and improving user exploration space.

  • Category Coverage:How many different content/product categories are covered by the recommendation results? For example, does the content recommended by a comprehensive video platform reasonably include film and television, variety shows, documentaries, knowledge popularization and other categories?
  • Distribution equilibrium measures (e.g., Shannon entropy):Calculate the distribution of recommendation results across different categories or topics. The higher the entropy value, the better the dispersion of the recommended content, and the less a single category or a few categories monopolize the recommendation results.
  • Practical strategy:Set clear diversity monitoring thresholds. For example, it can be stipulated that the total proportion of the top 3 popular categories in the list of recommended results should not exceed a certain preset value (such as 60%), and once the threshold is triggered, the system should automatically adjust the policy or issue an alert to guide the algorithm engineer to optimize the diversity weight.

2. User experience metrics

Novelty indicators:Measure the system’s ability to help users discover new things.

After 10 years of interaction design, why did I transfer to product manager?
After the real job transfer, I found that many jobs were still beyond my imagination. The work of a product manager is indeed more complicated. Theoretically, the work of a product manager includes all aspects of the product, from market research, user research, data analysis…

View details >

  • Quantification method:Statistics on the percentage of items that users have never interacted with (e.g., never clicked, purchased, or played) in their history.
  • The way of balance:Novelty cannot come at the expense of relevance. In practice, it is necessary to filter based on the estimated click-through rate (CTR) or relevance score of the item itself. For example, you can set up rules to prioritize items that have an estimated click-through rate higher than a certain baseline and are novel to current users. Avoid recommending content that is completely uninteresting to users in pursuit of “newness”.

Surprise Indicator:Measure whether the system can deliver valuable recommendations that exceed user expectations.

  • Assessment Challenges:Surprise is difficult to quantify directly and often requires a combination of qualitative feedback and indirect behavioral signals.
  • Qualitative pathway:User research (e.g., asking users “Are there any recent recommendations that surprise and satisfy you?”) “), focus group discussions.
  • Behavioral signals:Focus on users’ “high-value” behaviors after receiving specific recommendations, such as sharing, favoriting, and significantly higher than average reading/viewing time. An unusual boost in these behaviors may signal the emergence of surprise recommendations. Mechanisms need to be established to identify and track these signals.
  • Key Points:Surprise is not the same as novelty. An item that a user has never seen may be unpopular and of low quality; A surprise recommendation usually means that although it is not within the user’s regular interests, it has triggered positive feedback from users due to its high quality or unique value points.

2. Establish a clear indicator mapping

The improvement of technical indicators ultimately needs to serve business goals. One of the core responsibilities of a product manager is to build and continuously validate the pathway of “algorithm optimization – > user behavior change – > business outcome improvement”.

1. Build a conversion link model

2. In-depth analysis and monitoring of indicator mapping

Forward conduction verification:After the algorithm team optimizes a metric (e.g., CTR), the product manager needs to closely track changes in downstream behavioral metrics (detail page views) and final business metrics (e.g., GMV). For example, a CTR boost should theoretically bring more users to the detail page, and if the detail page’s conversion rate remains stable, you should eventually see an increase in order volume and GMV. It is necessary to establish a data dashboard that clearly shows the trend and correlation of each link on this link.

Link break diagnosis:When algorithm layer metrics improve but business metrics fall short of expectations or even decline, it is necessary to deeply analyze the intermediate user behavior layer. For example:

  • CTR rises but GMV stagnates: Need to check if the bounce rate of the detail page is increasing? Is the time spent on the detail page significantly reduced? This could mean that while the recommended content attracts clicks (with a catchy title), the actual content (product listings, video content) doesn’t match user expectations or needs, leading to a failed conversion.
  • CTR increases but add-on/collection rate decreases: It is necessary to analyze the attributes of recommended items (e.g., does the price band deviate from the mainstream consumption range of the target user?) Is the category too niche or not suitable for the user’s current scenario? )。 The algorithm may recommend items that users are “curious” about but have low actual purchase/spending intention in pursuit of clicks.

Incorporate long-term value indicators:Avoid the algorithm falling into the trap of short-term clicks. Indicators that reflect the long-term value of users need to be included in the evaluation system, such as:

  • User Retention Rate (Next Day/7th/30th):Is the recommendation system effective in retaining users?
  • Repeat Interaction/Purchase Rate:Do users continue to engage with or repurchase recommended content?
  • Proportion of high-value content/product recommendations:Is the system effective in guiding users to pay attention to and consume the high-quality/high-profit content that the platform wants to drive?
  • User satisfaction (NPS/questionnaire):How do users subjectively feel about the recommendation results? Regularly collecting user feedback is crucial.

3. Build a robust AB test platform

Empiricism is extremely risky in complex recommendation system optimization. AB testing is the core infrastructure for verifying the effectiveness of strategies and scientific decision-making.

1. Essential core modules of the AB test platform

1) Flexible and reliable flow scheduling system:

Core competencies: Ability to analyze data based on multiple dimensions (user profiles such as new and old users, activity, membership level; Access devices such as App/iOS/Android/Web/H5; region, etc.), to accurately stratify and randomly divert users.

Practical details: The triage rules need to be clearly defined and stable in advance to ensure that the characteristics of the experimental and control groups are uniformly and comparable. The shunt ratio (e.g., 5% flow to experimental group A, 5% to experimental group B, and 90% to the control group) needs to be flexibly configured. The system needs to ensure the stability of user grouping between experiments and in different time periods (user stickiness experiments are especially important).

2) Real-time comprehensive data monitoring center:

Core competencies: Collect and display the differences in performance between the experimental group and the control group on core indicators in real time (or near real-time).

Key Indicators:

  • Basic traffic metrics: PV (page visits), UV (number of unique users).
  • Core conversion metrics: click-through rate (CTR), conversion rate (CVR), purchase rate, play completion rate, etc.
  • User experience metrics: page load time, app stuttering rate, and error rate.

Early warning mechanism: Set the fluctuation threshold of key indicators (e.g., the CTR of the experimental group decreases by more than 10% compared to the control group), automatically trigger alarm notifications, and configure the policy rollback mechanism.

3) Rigorous and scientific effect evaluation engine:

Core competencies: built-in standard statistical significance test methods (such as t-test for continuous variables such as duration, amount; Chi-square test is used for proportional variables such as CTR and CVR), and the p-value is automatically calculated to judge the statistical significance of the experimental results.

Report Generation: Automatically output test reports containing key information such as core indicator comparisons, significance results, and confidence intervals.

Special Scenario Handling: For low-frequency but critical events (e.g., high purchases, paid member conversions), Bayesian statistical methods may be required for longer testing periods/larger sample sizes to improve the credibility of conclusions in small sample cases.

2. Key principles of AB test design and execution

Single variable principle:Try to change only one strategy variable per experiment (e.g., adjust the weight of the sorting algorithm, change the recall policy only, update the candidate pool filtering rules). If multiple variations must be tested, you need to design orthogonal experiments or use more complex experimental design methods such as multivariate experiments, and interpret the results with caution.

Guaranteed full test cycle:Testing must cover a long enough period of user behavior to capture the long-term effects and cyclical fluctuations of the strategy. For example:

  • E-commerce must include weekdays, weekends and possible promotional cycles.
  • Content platforms need to consider the peak and trough periods of user activity.
  • Educational products need to consider the impact of special periods such as the start of the semester, exam week, and holidays. Avoid making false judgments due to short-term fluctuations before the cycle is complete.

Establish anti-cheating and data cleaning mechanisms:Identify and filter abnormal user behavior (such as crawler traffic, malicious brushes, and data generated by employee test accounts) to ensure the authenticity and representativeness of experimental data. Clear rules for anomalous behavior and data cleaning procedures need to be defined.

4. Design the dynamic balance system of “Polaris + guardrail”

To ensure that the recommendation system does not deviate from the track of healthy development while pursuing the core goals, it is necessary to adopt a combined management strategy of “Polaris indicator + guardrail indicator”.

1. Anchor the North Star indicator

Definition principles:

  • It must directly reflect the core value of the product and the definition of success (is it user growth?) User retention? Monetization efficiency? Or ecological prosperity? )。
  • It must be significantly influenced by the optimization strategy of the recommended system.
  • It needs to be quantifiable, trackable, high-level business metrics.

Typical examples:

  • Content consumption platform: the average daily and weekly usage time of users, and the total content playback/reading volume.
  • E-commerce platform: total turnover (GMV), total platform revenue.
  • User growth products: Daily active users (DAU), monthly active users (MAU).
  • Tool products: Usage rate of core functions (e.g., “recommended content saving/citation rate” in the Notes app).

Key Points:The whole team (product, algorithm, operation) needs to reach a consensus on the Polaris indicators to ensure that the direction of resource investment is the same.

2. Set guardrail indicators

Function:Monitor the possible negative impact of the recommendation system optimization process to prevent damage to user experience or platform ecology in pursuit of Polaris metrics.

Common types of guardrail indicators:

1) Content/commodity ecological health:

  • Long-tail content/product coverage:The proportion of non-top (e.g., non-Top 1000) content/products in the recommended results. For example, set the rule that “the proportion of non-popular products in the recommended list is not less than 30%” to prevent the Matthew effect from intensifying and ensure the exposure opportunities of small and medium-sized creators/merchants.
  • Content Quality Monitoring:Use technical means (such as NLP models to identify headline parties, low-quality duplicate content, false information) or a combination of manual review to monitor the proportion of low-quality content in the recommended content pool and set thresholds for early warning or intervention.

2) User health:

  • Churn Rate:Pay special attention to the changes in the new user activation period (such as the 7-day churn rate of new users) and the retention period of old users (such as the 30-day churn rate of old users) after the policy adjustment. While the Polaris indicator is improving, the abnormal increase in the churn rate is a major risk signal.
  • Negative User Feedback:The proportion of users who report, complain, and “disinterested” feedback on the recommended content.

3) Technical experience guarantee:

Recommended result loading delay, interface error rate, etc.

3. Achieve dynamic balance

Establish an indicator correlation model:Understand the relationship between the North Star indicator and the key guardrail indicator.

For example, you can try to construct the formula: Polaris metrics (e.g., GMV) = Top products/content contribution * W1 + Long-tail product/content contribution * W2; W1 and W2 are weights set according to the business strategy (e.g., W1=0.6, W2=0.4), and the weight-guided algorithm balances short-term efficiency and long-term ecology by adjusting the weights.

Continuous monitoring and tuning:“Polaris + guardrail” is not static. Product managers need to continuously monitor the performance of all key indicators, and when the guardrail indicator hits the warning line, even if the Polaris indicator is performing well, they need to pause the strategy, analyze the reason, and make adjustments. The balance point needs to be continuously optimized according to the product development stage, market competition environment, user feedback, etc.

5. Phased implementation of the roadmap

Building an evaluation system is a gradual process that needs to match the maturity of the recommendation system:

Phase 1.0-1 (Cold Start & MVP Validation)

Focus: Quickly build the core basic assessment capabilities.

Action: Define and monitor the most critical small number of metrics (e.g., CTR, core conversion rate, next-day retention rate for new users).

Key: Use basic AB testing capabilities to quickly verify whether the core assumptions of the recommendation strategy hold true (e.g., are recommendations based on collaborative filtering more effective than popular recommendations?). to ensure that the system is basically available and delivers positive value.

2.1-10 stages (scale & rapid iteration)

Focus: Enrich the evaluation dimensions and establish an efficient iterative closed loop.

Let’s go:

  • Introduce user experience metrics such as diversity and novelty.
  • Perfect” algorithm -> Behavior -> business” indicator mapping relationship to establish a data dashboard.
  • Establish a regular (e.g., weekly) data alignment mechanism for product, algorithm, and data teams to jointly analyze changes in indicators and determine optimization priorities.
  • Strengthen AB testbed capabilities to support more complex experimental designs and faster iterations.

Key: Ensure that the evaluation system can keep up with the rapid iteration of services and algorithms, and that data insights can effectively guide decision-making.

3.10-N stage (ecology & Refined operation)

Focus: Build a comprehensive health monitoring and long-term value evaluation system.

Let’s go:

  • Establish a complete “Polaris + Guardrail” indicator combination system, and set clear monitoring thresholds and response mechanisms.
  • Develop a recommendation system health assessment model, which may integrate technical indicators, user experience indicators, ecological indicators, and user satisfaction (such as NPS) to form a comprehensive score or dashboard.
  • Deeply analyze the correlation between long-term user behavior (e.g., retention curve, LTV prediction) and recommendation strategies.
  • Explore more forward-looking assessment methods, such as causal inference analysis of long-term strategy impact.

Key: Ensure that the recommendation system maintains ecologically healthy, user satisfaction, and sustainable business growth while pursuing efficiency.

End of text
 0