In today’s era where the AI wave is sweeping across all walks of life, recommendation systems are moving from behind the scenes to the forefront, becoming the core driver of user experience. This article will take you through the entire process of starting a large-scale AI recommendation system from scratch, revealing the key path of how real-time engines evolve from a single tool to a complex ecosystem. Whether you are a product manager, a technical practitioner, or an explorer curious about AI system architecture, this article will provide you with first-line practical experience and systematic thinking.
Building a real-time engine that supports large-scale AI recommendation systems is the key to improving user experience and business effectiveness. This is not only an upgrade of tools, but also the evolution of the entire data processing, model training and service architecture to a real-time and intelligent ecosystem. Here’s a breakdown of the core path:
1. Real-time recommendation of scene design
The core of real-time recommendations lies in understanding the differences in scenarios and responding accurately:
Information flow scenarios
Challenge:User behavior is highly fragmented (fast swiping, short stays), and points of interest are transferred quickly.
Core objectives:Identify changes in interest and adjust content flow in milliseconds.
Key technical points:
1) Dynamic feature fusion:
- Integrate user behavior in real-time (clicks, play completion, skip rate)
- Content dynamic indicators (recent like/comment growth rate, CTR)
- Contextual information (current time period, geolocation, network status)
- Construct high-dimensional real-time feature vectors
2) Hierarchical and efficient recall sequencing:
- Rough Row:Use lightweight models (such as ANN/HNSW-based approximate neighbor search) or efficient rules (such as real-time interest tag matching) to quickly filter out hundreds/thousands of related items from a massive candidate pool, and the response time is strictly controlled to milliseconds.
- Fine arrangement:Complex depth models (such as DIN-Deep Interest Network and DIEN-Deep Interest Evolution Network) are used to personally score and rank the coarse results to capture the evolution of user interest in detail.
- Reset:Based on the results of the refinement, strategies such as diversity (covering different categories/topics), novelty (exposure control), and business rules (operation bits, commercialization strategies) are introduced to optimize the user experience and platform goals of the final presentation sequence.
Shopping cart/matching scene
Challenge:Users have a clear purchase intention and need to increase the unit price and associated purchase rate.
How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?
View details >
Core objectives:Deliver high-converting recommendations based on the user’s current intent.
Key technical points:
1) Scenario-based combination recommendation engine:
- Combine strong product association rules (frequent item set mining based on historical orders/behaviors, graph relationship learning) and user individual preferences/historical paths.
- Build a hierarchical combination strategy of “core products + strongly related accessories + potential interest recommendation”. The strategy weight can be dynamically adjusted based on real-time signals such as user purchase behavior and page dwell time.
2) Real-time inventory and business status awareness:
- Real-time integration with inventory management system (IMS) and promotion system.
- When the real-time inventory of the recommended product falls below the safety threshold, or the promotion status (such as the end of the limited-time discount) changes, the recommendation engine needs to complete the candidate replacement (select the same quality high-inventory or high-availability product) in a very short time (milliseconds~seconds).
- It is necessary to design a front-end UI feedback mechanism (such as inventory tightness prompts, promotional label dynamic updates) to ensure user perception of real-time.
2. Build a low-latency streaming pipeline
Streaming computing is the lifeblood of real-time recommendation engines, designed to meet core requirements: low latency (milliseconds~seconds), high throughput (millions of events/second), and elastic scalability.
Data access layer
Multi-source heterogeneous data integration:Support for high-throughput message queues (Kafka, Pulsar) to access user behavior logs (clicks, browsing, add-ons, purchases), business events (product removal, price/inventory changes, event releases), and third-party data streams (real-time weather, traffic, and public opinion events).
Real-time data cleaning and standardization:
- Define and enforce strict dirty data filtering policies (handling log duplicates, device ID anomalies, formatting errors, etc.).
- Implement data masking (e.g., one-way hashing of user IDs, sensitive field masks).
- Establish real-time field mapping and conversion rules (such as product ID mapping to category trees and geographic coding to business districts).
Real-time compute layer
Core real-time indicator definition and calculation:
- Real-time user activity:Based on the sliding time window (e.g., 5 minutes, 1 hour), the frequency of user behavior (number of clicks, interaction duration) or complex aggregation (session depth) is counted.
- Content/Product Dynamic Popularity:Algorithms such as EWMA (Exponential Weighted Moving Average) are used to calculate the growth rate of recent interactions (likes, favorites, purchases) to reflect instantaneous changes in popularity.
- Scene Context Weight:Dynamically adjust the strategy weight or feature combination of the recall and ranking model based on the pages users are currently visiting (homepage feed, search listing page, product detail page, shopping cart page).
Real-time feature engineering platformization:
- Provide configurable languages (such as SQL-like, XL-Formula) to define complex statistical features (such as “the number of products in a specific three-level category viewed by users in the past 1 hour” and “the proportion of clicks on similar products in the last 30 minutes”).
- Supports aggregate calculations based on time windows and event sequences (counting, summing, deduplication counting, max/min values).
Event-Driven Response Mechanism:When real-time computing detects that user behavior meets preset rules (such as clicking on the same category of products N times in a row and completing high-value conversions in a short period of time), it can immediately trigger model parameter fine-tuning, recall policy switching, or operational intervention.
Output and service layer
High-performance service interfaces:Clearly define the input/output format (JSON/Protobuf) of the recommendation result API, mandatory response time SLA (e.g., P99 < 200ms), and supportable peak QPS (e.g., 500,000+).
Robust Fault Tolerance and De-Grade:
- Automatically and seamlessly switch to pre-computed offline recommendation results or hot list when real-time compute services or downstream dependencies (feature libraries, model services) fail or have high latency.
- Integrate monitoring systems such as Prometheus and Grafana to track key metrics in real time: API error rate, processing latency at each stage, system resource load, and bottom-up traffic ratio.
3. Online learning system
Online learning is the core of real-time response to recommendations, and it is necessary to solve the balance between high-frequency model updates and online service stability.
Incremental learning framework design
High-value sample selection: Prioritize queue mechanisms to ensure that recent, high-conversion value (e.g., purchase, deep interaction) user behavior samples enter the training process faster. Combined with the time decay factor, the weight of the old sample is reduced.
Reliable parameter updates and deployments:
- Asynchronous training-inference decoupling:The training process is deployed independently and separated from the online inference service. The incremental and near-real-time synchronization of model parameters (seconds~minutes) is realized through a parameter server or shared storage (such as Redis distributed file system).
- Strict version control and rollback:Keep a snapshot of historical model versions. Before any new version is launched, it must go through rigorous A/B testing or interfacing testing to verify that the performance improvement (CTRGMV and other core indicators) has no negative impact before it can be fully quantified. Support second-level rollback mechanism.
Real-time feedback closed-loop construction
Real-time reflow of behavioral data:Signals such as exposure, clicks, and conversions (add-ons, purchases) of recommended results must be fed back to the training system in real time (within seconds), forming a closed loop of “recommendation – > user feedback – > model optimization”. This is the fuel for the model to adapt quickly to changes.
Cold start problem response:
- New users:Perform rapid initial recommendations based on static characteristics such as device information, initial geolocation, and access channels, combined with similar user group behavior patterns based on demographics or content attributes. The model needs to be able to quickly absorb the initial behavior.
- New Items/Content:Leverage preset metadata rules (based on content tags, publisher information) and lightweight real-time collaborative filtering (based on the similarity of the content itself or association with existing popular items) to quickly build item feature vectors and fit into the recommendation candidate pool.
High availability architecture and resource management
Hybrid Inference Deployment:For the extremely latency-sensitive refinement/rearrangement link, lightweight models (LR logistic regression, FM factorization machine) can be deployed to edge nodes/CDNs. Complex depth models are deployed in the central GPU cluster. The lightweight model effect is optimized by using model distillation and other technologies.
Elastic resource scheduling:Dynamically scale compute resources (such as K8s HPA) based on real-time traffic prediction (using historical models or simple time series models) and system monitoring metrics (CPU, memory, GPU utilization). Implement automated resource provisioning using solutions such as Tencent Oceanus and Flink Native K8s to ensure that service level objectives (SLOs) are maintained during traffic peaks.
4. Fusing mechanism
Fusing is the core line of defense to ensure the overall availability of the recommendation system, and it is necessary to achieve accurate identification, rapid response, and orderly de-escalation.
Intelligent circuit breaker trigger decision
Multi-dimensional monitoring index system:
- Service performance:The response time of P99/P95 continues to exceed the standard (e.g., > 500ms for 1 minute), and the API error rate rises sharply (e.g., > 15% for 5 minutes).
- Resource bottlenecks:GPU memory utilization > 85%, CPU load > 90%, and high risk of memory overflow.
- Downstream dependent health:Failures or high latency of key downstream dependencies such as feature storage, databases, model services, etc.
Dynamic threshold adjustments:Based on business cycle characteristics (e.g., traffic surges during major promotions are the norm), use the baseline prediction model to dynamically adjust the circuit breaker threshold to avoid false triggers under tolerable normal business fluctuations.
Fusing state machine and recovery process
- Close:Normal service, continuous monitoring.
- Open it:When a fuse is triggered, it immediately cuts off the flow to the faulty component, returning a preset offline/bottom result. Start the cooling timer (e.g. 5 minutes).
- Ajar:After cooling, allow a small number of probe requests (e.g., 1-10% of total flow rate) to pass through. if the success rate reaches the standard (such as >90%), the fuse will be closed; Otherwise, reset the timer and return to the open state.
Refine the downgrade strategy
- Feature Classification and Downgrade:Clearly prioritize recommended features (core: main feed stream; Sub-core: related recommendations; Non-core: personalized pop-ups/ads). When fuse, it is downgraded from low to high by priority.
- Effective coverage:Pre-generate and cache lists of high-heat/premium content (e.g., Top-N items/content) based on offline calculations. Ensure that users still see relevant and basically usable results when downgrading.
- Transparent user communication:Provide concise status cues (such as “Recommendation Loading” or “Service Optimization”) in the appropriate location on the client, such as recommended placeholders, to manage user expectations and reduce frustration.
5. Build a real-time recommendation ecosystem
The essence of building a large-scale real-time AI recommendation engine is to promote the deep collaboration and continuous evolution of the three major systems of data, algorithm, and engineering:
- Data Layer:Streaming capabilities are fundamental, with the goal of transforming raw data into high-value features (DataasFeatures) that drive recommendations in real time.
- Algorithm layer:The combination of online learning with offline batch training and reinforcement learning gives the model the ability to continuously self-optimize and keep up with business dynamics.
- Engineering layer:Architecture design such as fuse, elastic scaling, and hybrid deployment builds rock-solid system stability while pursuing ultimate real-time.
Looking forward to the future, the device-edge-cloud collaborative computing architecture will become more and more important, with lightweight real-time inference and preliminary feature extraction on edge devices, complex model training and global optimization in the cloud, and combined with federated learning and other technologies to realize the value mining of a wider range of data under the premise of protecting user privacy, and promote the evolution of real-time recommendation in a smarter and more secure direction.