RAG (Retrieval-Augmented Generation), an innovative technological paradigm that blends information retrieval with natural language generation, is reshaping the way AI is applied in knowledge-intensive tasks. This article will provide an in-depth analysis of the origin, development context, and architecture and evolution path of RAG in modern AI systems.
RAG (Retrieval Augmented Generation) is a paradigm that enhances the capabilities of large language models (LLMs) by integrating external knowledge sources. It can be traced back to RAG’s knowledge lineage from the two basic fields of information retrieval (IR) and natural language generation (NLG), and experienced the early exploration of Open Domain Question Answering (ODQA) until it was officially finalized in 2020. Dissects the architecture of modern RAG, detailing its core components and evolution path from naïve to advanced and modular RAG. Finally, the thinking about the emergence of agentic, adaptive and multimodal RAG is triggered
Chapter 1: Hybrid AI, the basic predecessor of RAG
Retrieval Augmented Generation (RAG) is not an invention that appeared out of nowhere, but is the product of the fusion of two independent fields with a long history, Information Retrieval (IR) and Natural Language Generation (NLG), catalyzed by specific technological breakthroughs.
1.1 Twin Pillars: The History of Information Retrieval (IR) and Natural Language Generation (NLG).
The core concept of RAG lies in the combination of “retrieval” and “generation”, both of which originate from two well-established branches of computer science.
1) Information retrieval
The history of information retrieval can be traced back to the 50s and 60s of the 20th century, and its core goal is to find information relevant to user queries from large-scale document collections. Pioneers such as Hans Peter Luhn and Gerard Salton laid the foundations of the field. The research group led by Salton at Cornell University was an early powerhouse in the field. Key concepts include:
Vector Space Model:
The model represents documents and queries as vectors in high-dimensional space, and judges the correlation by calculating the similarity between vectors (such as cosine similarity), that is, we artificially set a rule to define coordinates for words, digitize the text, so that the computer can understand the meaning of articles and words by calculating the distance and direction of the coordinates.
To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.
View details >
This is an early attempt at semantic retrieval. For example, we need a computer that doesn’t understand human language at all to understand the relationship between words. How should it “think”?The core question isComputers only know numbers, not words. How to convert the “meaning of words” into numbers that computers can process? So it can only be solved by vectorization, the word sounds mathematical, but you can understand it as, meaning coordinates. Just like on a map, each city has a latitude and longitude coordinate (e.g., Beijing: 116°E, 39°N). We can use a set of numbers [116, 39] to uniquely identify the location of Beijing.
Vectorization does something similar, we assign each word a “meaning coordinate”, which is usually made up of hundreds or thousands of numbers.
Simple example (2D coordinates):
Suppose we describe words in two dimensions: [“biological”, “human relevance”]
- “Dog” may be represented as [9, 4] (biologicality score 9, human relevance score 4)
- “Cat” may be represented as [9, 3] (similar to “dog”, so the coordinates are close)
- “Stone” may be represented as [1, 0] (very low biologicality and low human relevance)
- “King” may be represented as [7, 9] (more biological, highly human-related)
- “Queen” may be represented as [7, 8.9] (very close to the coordinates of “King”)
The multi-dimensional “meaning map” composed of the “meaning coordinates” of all these words is thatVector Space。 On this “map”, words with similar meanings are very close to each other.
OK, when we get a vector space, how do we calculate exactly how “similar” two words are? “Cosine similarity” is a measurement method. It doesn’t care about the straight-line distance between two points, but about themConsistency of orientation。 For example, you and your friend are starting from the center of the same square. If you are all towardsDue northGo, even if you walk 100 meters and he walks 200 meters (the distance is different), your direction is exactly the same. At this point, your cosine similarity is the highest (equal to 1).
- If one faces due north and the other faces due east, at an angle of 90 degrees, then the cosine similarity is 0, which means “uncorrelated”.
- If one faces due north and the other faces south, the directions are completely opposite, and the cosine similarity is the lowest (equal to -1), which means “opposite meaning”.
Therefore, in vector space, the vector directions of “cat” and “dog” will be very close, so their cosine similarity is high. The vector directions of “cat” and “spaceship” will be very different, and the similarity will be very low.
Word Frequency-Inverse Document Frequency (TF-IDF):
As a classic weighting technique, TF-IDF is used to assess the importance of a word to a document or a corpus. In other words, which word appears the most often is the most important. But soon you will find the problem, words like “of”, “yes”, “and” appear the most often, but they are meaningless.
TF-IDF is a very clever weighting algorithm designed to solve this problem. It consists of two parts:
a) TF – Term Frequency
For exampleexample In a 1,000-word article on “artificial intelligence,” the word “model” appears 30 times.
Its TF is 30 / 1000 = 0.03.
b) Inverse Document Frequency (IDF)
This is the essence of the TF-IDF algorithm. It is used to measure the rarity of a word. The more a word appears in articles, the less “unique” it is and the lower the IDF value. Just like the words “of” and “is” mentioned earlier, almost all Chinese articles will have them. So their IDF value will be infinitely close to 0. Technical terms such as “gradient descent” and “neural network” will only appear in AI-related articles. Therefore, if they are “rare” in all articles, the IDF value will be high.
Formula: TF-IDF = Word Frequency (TF) × Inverse Document Frequency (IDF)
The final score takes into account both “importance in this article” and “uniqueness in all articles”.
Back to the example:
- Model: The TF value is high in AI articles, and it is not a common word (not commonly used in economics and architecture articles), so the IDF value is also high. In the end, TF-IDF scored very high and was recognized as a keyword.
- “of”: The TF value is extremely high in AI articles, but its IDF value is close to 0. The two are multiplied and the final TF-IDF score is also very low, which is considered a meaningless stop word.
- “Mobile”: It may not appear once in this AI article (TF=0), then no matter how high its IDF value is, the final TF-IDF score is also 0.
Probabilistic Models:
TF-IDF is already very good, is there any room for optimization? It is still not “smart” enough in some cases.
Probabilistic models, especially algorithms like BM25, can be seen as a super upgrade of TF-IDF**. Instead of simply multiplying two values, it thinks in terms of “probability”, how likely is it that a user will be satisfied with the article when he searches for the word?
BM25 is optimized in two main areas:
Term Frequency Saturation
TF-IDF has an assumption that 10 occurrences of a word are twice as important as 5 occurrences. But is this the reality?
For example: In an article, the word “apple” appears once, indicating that the article is related to Apple. If it appears 10 times, then its correlation with Apple is greatly enhanced. But if you increase from 10 to 20, the increase in correlation is actually weakening.
BM25 introduces the concept of word frequency saturation, which means that the effect of word frequency will be “saturated” with an upper limit. Just like eating, the happiness of eating the first bowl of rice is very strong, and the happiness of eating the second bowl is still increasing, but by the tenth bowl, the happiness will not increase again, and you still want to vomit a little. This is more in line with our intuitive sense of importance.
Document Length Normalization
TF-IDF treats long and short articles less fairly.
For example: Search for “Einstein”
One is a 500-word introduction to Einstein.
The other is a 500,000-word History of Physics, which also mentions Einstein. It is likely that in the masterpiece “History of Physics”, “Einstein” appears more often, resulting in a higher score of the TF-IDF. But this obviously doesn’t make sense, because that introduction is the more relevant document. So BM25 will“Punish” long documents。 It takes into account the length of the article, and if an article is well above average, it needs more keywords to appear to get the same score as a short article.
Natural language generation
At the same time, natural language processing (NLP) and its sub-field, natural language generation (NLG), are also developing independently. The goal is to enable computers to understand and generate fluent, coherent human language. Its development history includes:
Early exploration: Initial attempts focused on rule-based machine translation and grammar-based text generation, which often relied on structures such as recurrent neural networks (RNNs).
Statistical language models: In the 80s and 90s of the 20th century, with the rise of statistical methods, the N-gram model became mainstream. It generates text by calculating the probability of a sequence of words appearing, greatly contributing to the development of NLG.
The development of NLG provides the necessary technical reserve for the “generation” link of RAG, allowing it to transform the retrieved information into natural, readable answers.
A historic “competition”: In early development, there was a methodological “competition” between the IR and AI/NLP fields. The IR field is more inclined towards statistics and quantitative analysis, while early AI focuses more on logic and symbol-based reasoning. This difference has led to the two fields developing in parallel for a long period of time, with little intersection. However, it is this differentiated development that lays the groundwork for the complementary and integration of the advantages of the two in the future.
1.2 Early Integration Attempts: The Open Domain Question Answering (ODQA) Era
Before RAG was officially proposed, the Open-Domain Question Answering (ODQA) system was the most successful attempt to integrate IR and NLG, and can be regarded as a “proto-RAG” (Proto-RAG).
It can be understood that before the concept of RAG (Retrieval Augmented Generation) was officially born, its “prototype” or “old-timer” existed for a long time, and it wasOpen-Domain QA system。
From “single subject exam” to “super library”
The earliest Q&A AI played “closed question and answer”, just like a single-subject open-book exam. You can only ask questions about a book or a specific area (such as a company product manual).
ODQA, on the other hand, throws AI into a super library (like the entire Wikipedia) and asks it to answer factual questions on any topic. This is the first time that AI has systematically tried to address a challenge that requires massive external knowledge, and the difficulty has increased exponentially. The famous AI “Watson” developed by IBM for TV shows is the pinnacle of this era.
The classic “two-step” workflow
These early “prototype RAG” systems used a very classic “two-step” pipeline architecture:
Step 1: Librarian (Retriever)
This module is like a librarian, after receiving your question, it first uses some traditional search techniques (such as TF-IDF or BM25, which can be understood as keyword matching) to quickly find dozens of potentially relevant paragraphs from a large number of documents.
Step 2: Reading Comprehension Specialist (Reader)
These identified passages are then given to an independent “reading comprehension specialist”. Its task is to read these passages intensively and extract answers accurately from them, or generate a one-sentence response based on this information.
This”Fish for information first, then read and answerThis model is the prototype of RAG’s core idea.
At the same time, despite the success of this “two-step” model, it had several fundamental flaws, and it was these flaws that directly led to the later more advanced RAG architecture:
The field of view is too narrow, and you can only see the trees and not the forest. At that time, “readers” had limited capabilities and could only process very short snippets of text. This forces the “librarian” to hand it only a small piece of text. However, the answers to many complex questions need to be read through the entire article or even several articles to understand, and this practice of “peeking into the tube” often leads to the loss of key information.
Searchers and readers “block each other” and cannot cooperate. This is the fatal point. Retrievers and readers are trained separately and work independentlyTarget. This means that even if the reader finds that the material given by the retriever is full of garbage, it is alsoI can’t pass this “feedback” back, tell the searcher, “Don’t look for this next time, look in another direction.” Retrievers can never learn from the quality of the final answer, and they cannot become smarter.
The knowledge base is insufficiently informative. Sometimes, the answer is not in the knowledge base at all. Or, the user’s question is ambiguous, causing the searcher to find a bunch of irrelevant “noisy” documents that seriously interfere with the reader’s work.
Extremely poor interprofessional ability. A searcher trained in the general library of “Wikipedia” is blinded by the fact that you send it directly to a professional “medical” or “legal” library. Because it does not recognize the technical terminology and does not understand the knowledge structure of the field. To adapt it to a new field, it would have to be retrained at a huge cost, which was very cumbersome.
It is precisely to solve these “narrow field of vision”, “poor collaboration” and “poor adaptability” that a more integrated and intelligent RAG framework has emerged.
1.3 Technology Catalysts: The Rise of Transformer and Intensive Retrieval
Breakthroughs in two key technologies paved the way to address these challenges and ultimately give birth to RAG.
Transformer Revolution: The proposal of the Transformer architecture in 2017 was a watershed event. Its core self-attention mechanism enables the model to capture long-distance dependencies in text and generate contextualized embeddings. Models like BERT can deeply understand the exact meaning of words in different contexts, far beyond simple keyword matching. Before that, computers understood words in a sentence largely in isolation, or could only see a few words next to them. whereas Transformer Let computers become human-like,Read through the entire paragraph to understand the exact meaning of each word in the current context.
For example, the meaning of “Cao Cao” in the two sentences “Say Cao Cao, Cao Cao to” and “Cao Cao is a big traitor” has different meanings
Dense vs. sparse searchThis progress has directly promoted the innovation of search technology, that is, the transformation from “sparse search” to “intensive search”.
Sparse Retrieval: Represented by TF-IDF and BM25, which rely on exact matching of keywords to represent documents as “sparse” vectors with high dimensions but most elements zero. It’s like the back of the bookindexOr Ctrl+F on your computer to find it. If you search for “car”, it will only look for the place where the word “car” appears in the text. The characteristic is that it is fast, because it only does simple text matching. But at the same time stupid, it does not understand that “car” and “vehicle” are synonymous. If you search for “cars”, it will definitely not find an article that only writes about “vehicles”.
It is conceivable that it represents an article with a huge list of all possible words. An article uses only a small part of these words, so the vast majority of this list is 0, which looks “sparse”.
Dense Retrieval: Harnessing the powerful understanding of Transformers, it is no longer matching text;Match “meaning”。 It converts your query (like “A book on the decline of the Roman Empire”) and all documents (like a book titled “Late Antiquity”) into a meaning vector (a string of numbers) that represents its core idea. It then mathematically looks for which document has the “meaning” closest to your query. The characteristic is that it can understand concepts. Even if the article does not have the words “the decline of the Roman Empire”, if the content is about this, it can be found. This is the real thingSemantic search。
So, because we have a “brain” like a transformer that can understand the context deeply, our search technology has also been upgraded. Instead of settling for rigid keyword lookups (sparse search), we have evolved to smarter semantic searches that understand your true intent (intensive search).
The combination of these two ultimately creates the perfect conditions for RAG. By unifying the advantages of the two into one framework, it finally solves the long-standing “competition” relationship between IR and NLP, and achieves the effect of one plus one greater than two
Large models like GPT-4 have strengths in understanding, reasoning, summarizing, and generating fluent language.Information retrieval (IR) is like a technology in search engines, but its strength lies in finding the most relevant content from a large amount of information. As the number of parameters of large language models increases exponentially, a fundamental problem arises: if you try to stuff all the knowledge in the world into the “brain” (parameters) of a large model, the cost will be unbearable, and once the model is trained, the knowledge will be “outdated” and cannot be updated. And RAG isIt does not shoehorn all knowledge into the model, but stores massive amounts of knowledge that need to be updated frequently in an external, low-cost “library” (such as a vector database) and uses itBM25Effective for proper nouns, codes, etc. that require exact matches. exploitModern semantic searchUnderstand vague, conceptual questions. When it is necessary to answer a question, the model will first go to this “library” to find the latest and most relevant information, and then use its reasoning and summarization skills to generate the final answer. Therefore, it allows the large model not to become a nerd who memorizes all knowledge, but a genius who knows how to use external libraries to obtain the latest information, and on this basis, think deeply and answer
Chapter 2: Formalization of RAG – A Paradigm Shift for Knowledge-Intensive NLP
In 2020, the RAG framework was officially proposed and named by Patrick Lewis and colleagues, a paper published by Patrick Lewis and colleagues, marking the birth of a new paradigm.
2.1 Lewis et al.’s Technical Dives: A Groundbreaking RAG Paper
The core contribution of this paper, published by researchers at Facebook AI (now Meta AI), is to propose a “general-purpose fine-tuning recipe”. The recipe aims to combine a pre-trained parametric memory with a non-parametric memory.
For example, let’s say we are going to participate in oneOpen book examThe exam topic is an essay question, please analyze and summarize the latest research progress on quantum entanglement.
Before answering the question
When I first saw the test questions, parametric memory was like my brain, the knowledge that had been memorized, and this knowledge had been internalized into a part of you (stored in the “parameters” of the model). Its advantage is that it is responsive and can be called at any time. The downside is that there is a limit to knowledge, and you may remember or forget details. This is itseq2seq model, which is a knowledgeable “brain” in itself.
The other is non-parametric memory, which is like a reference material brought into the examination room, which is what you are allowed to bring for the examExternal sources, such as an entire Quantum Physics textbook or a stack of latest academic papers (in the RAG paper, this is Wikipedia). This knowledge is not stored in your brain, but placed externally and can be “consulted” when needed. Its advantage is that it is informative, accurate and can be updated at any time (such as a new edition of a book). The downside is that it takes time to find information. This is itExternal knowledge base。
But now how can we make good use of this reference and treat the retrieved document as a latent variable
When we see the topic “quantum entanglement”, we quickly search through your references (non-parametric memory). Instead of just looking for one article, you find 5 papers that all seem relevant. So the question is, which is the “correct” source of the answer?
Old, rigid methods (non-RAG) will first judge “Well, Part 3 is the most relevant!” Then write all your arguments based entirely on the third article. If you make a mistake in judgment, or if the answer actually needs to be combined with Part 1 and Part 3, your final score will be very low. It’s a rigid “retrieve first, generate later” pipeline. RAG has a smarter approach (latent variable thinking), which is not to immediately decide which one is “the only correct”. Your brain will perform a quick, vague probability assessment. This “uncertain, hidden source of correct answers”, the so-called latent variables, you will start to ponder, and based on my knowledge, thesis 1 seems to have a 70% chance of being useful for solving the problem. Thesis 2 There is only a 10% chance. And thesis 3 has a 90% chance of being critical.
In the answer
Next, every sentence we write is not just written by looking at a certain article, but the result of “fused” all the high-probability data in our minds.
This is the popular understanding of “marginalization”: weighted summation and fusion of multiple possibilities.
Instead of considering a single source, if I only read paper 1, I might write this sentence, but if I read paper 3, I might write from another angle, but weighted fusion is marginalization, in fact, it is the result of balancing after we read three papers.
Final sentence = (90% weight * by paper 3) + (70% weight * by paper 1) + (10% weight * by paper 2) + …
In this way, the answer synthesizes the essence of all relevant sources rather than sticking to one article. Even if the most important paper (paper 3) lacks a certain detail, you can add it from paper 1. This makes your answers more comprehensive and accurate.
After answering the questions (end of the exam)
The teacher corrects the test paper, which is the process of model training. If yesOld, rigid methods. The teacher will give you a score in two steps. The first step is to assess your ability to “pick and select references” (train the retriever separately). The second step is to assess your “writing ability” (train the generator separately) based on your chosen material. These two links are disjointed.
ButThe smarter approach to RAG (end-to-end training) isThe teacher only looks at youThe final answer。 If your argument is brilliant (and the final output is correct), he will give you a high score. This “high score” will reward you for the entire answering process at the same time.Rewards both your writing skills (generator) and your ability to pick and blend materials (retriever)。
On the contrary, if your answer is wrong, this bad review will also be passed on to you throughout the process at the same time, allowing you to reflect on whether you are wrong in checking information or writing wrong? The model will be in the next training,Automatically and synchronouslyAdjust these two parts.
End-to-End The advantage is that we no longer need to tell the model which one to read and which one not to read. Instead, we just need to give it the final correct answer, and it willLearn how to find information and how to use it yourselfThe whole process is completed in one go, which greatly reduces the complexity and cost of training.
2.2 Architecture innovation: the combination of parameterization and non-parameterized memory
The RAG architecture of Lewis et al. consists of well-defined components that work together to enable the dynamic fusion of knowledge.
To accomplish a complex reporting task, we assembled an elite team. This team consists of two core members:
A librarian (retriever) and a lead writer (generator)
This librarian is responsible for quickly and accurately finding the most relevant information about the topic of the report from a huge library (knowledge base, such as Wikipedia).
His core skills:
DPR (Dense Passage Retrieval)。 Dense means that he understands meaning, instead of simple keyword matching. He can understand that “the official residence of the President of the United States” and the “White House” are talking about the same thing, even if they are literally completely different.
His workflow:
- Encoding: Before receiving any task, the librarian undertakes a huge preparation.
- Document Encoder: He reads in the libraryEvery book, every articleand write a “content summary card” for each article. This card is very special in that it is not written in words, but in a unique meaning code (ievector) to express. Eventually, he built a catalog of millions of semantic code cards, which was the document vector index.
- Query Encoder: When a report task (user issue) is issued, he will first ask the query encoder to convert the problem tooExactly the same formatmeaning code
- Searching: Now, he has a meaning code card representing the problem and a book summary code card for the entire catalogue.
- Maximum Inner Product Search (MIPS): This is a technique that sounds complicated but is intuitive in principle. Think of it as a magnetic matching system. Both question code cards and book code cards are specially made magnets. The MIPS system can instantly calculate which book cards in the catalog have the most magnetic force (inner product) and the magnetic force of the problem cards.
- In the end, he will find out the top K (such as Top-5) books with the strongest magnetic attraction and hand them over to the chief writer.
- Next comes the writer (BART model), who is responsible for compiling the user’s initial questions and the information found by the librarian into a smooth and accurate report. He is a master of languages with 400 million “brain cells” (parameters). He is very good at understanding context (bidirectional) and can generate beautiful sentences word by word (auto-regressive). His task is to receive two things:Original questionandInformation found by the librarian, and then write the final answer based on these two.
Now, two core members of the team are in place. However, depending on the complexity of the reporting task, they have two different models of collaboration.
RAG-Sequence (Single-Source Focus)
This pattern works for those answers that tend to be included inIn a single, coherent documentmission. For example, “Please tell us about the history of the Eiffel Tower”.
Workflow:
- The librarian found the 5 most relevant articles.
- heTake out the first article first, said to the writer: “Please write a complete draft report based only on this article.” “The writer completed a draft A.
- Then, he retracted the first article,Take out the second article, said: “Now, please forget what you just said, and write an independent complete draft based on this article.” “The writer completed Draft B.
- This process is repeated K times (say 5 times) to end up with 5 separate draft reports.
- Finally, the team evaluates and integrates the five drafts to arrive at a final report. When fused, drafts from more relevant articles (such as the one with the strongest magnetic force) will have higher weight.
Peculiarity: The structure is simple and the ideas are clear. Each writing focuses on only one source of information, ensuring coherence in the content.
RAG-Token (Flexible Multi-Source Fusion Mode)
This model is more powerful and suitable for needsIntegrate multiple sources of informationcomplex answers can be formed. For example, compare and summarize the different strategic priorities of the Allies in Europe and the Pacific during World War II.
Workflow:
The librarian found the 5 most relevant articles.All at once spread on the writer’s desk。
The writer began to write reports, but he did not read them one by one, butWrite word by word (token by token)。
When writing the first wordHe will quickly go through all the 5 articles on the table, think about it, and after synthesizing all the information, decide which word is the best to start with.
When writing the second wordHe will quickly go through all 5 articles again and decide what the second word should be based on the first word he just wrote.
This process is repeated with each token generated. Throughout the writing process, writers always maintain a “global vision” of all relevant materials, dynamically and flexibly extracting the most needed information from them at every step.
Among them:The ability to seamlessly blend fragmented information from different sources into a coherent answer is ideal for dealing with complex questions that require comprehensive analysis.
OK, after a brief understanding of how RAG works, we will find that before RAG, large language models were largely “black boxes”. Their knowledge is solidified in billions of opaque parameters, and their decision-making processes are difficult to explain. RAG fundamentally changes this by externalizing knowledge sources. It creates an essentially moreTransparent and verifiablesystem.
In principle, users can check the external documents referenced by the model to verify the authenticity of the content it generates. The paper itself emphasizes this, arguing that RAG provides greater controllability and interpretability, and redefines the standard of AI: a model must not only generate plausible answers, but also provide traceable evidence to support its answers. This is crucial for RAG applications in enterprise environments, where auditability, reliability, and trust are indispensable.
Chapter 3: Anatomy of Modern RAG Systems
Deconstruct today’s typical RAG systems with a detailed analysis of their common architectural components and workflows.
3.1 Core Pipeline: Step-by-Step Analysis
The workflow of a modern RAG system can be clearly divided into two main phases, the offline indexing phase and the online inference phase. This division reflects how the system preprocesses knowledge and responds in real time when it receives a user request.
Indexing – Offline phase:
This is the preprocessing phase of the knowledge base, with the goal of creating an efficient, searchable knowledge index. This stage is typically completed in the background once or periodically and includes the following steps:
Load: Loads raw data from various data sources (e.g., file systems, databases, APIs).
Split: Splits loaded long documents (e.g., PDFs, web pages) into smaller, semantically complete chunks of text (Chunks). This step is crucial because LLMs have a limited context window and are more accurate for retrieval on smaller, topic-focused blocks of text.
Embed: Use the Embedding Model to convert each block of text into a high-dimensional vector of numbers. This vector captures the semantic information of the text block.
Store: Stores the generated text block vectors and their corresponding original text content into a dedicated vector database (Vector Store) and indexes these vectors for quick similarity searches.
Retrieval and Generation – Online/Inference Phase:
This is the stage where the system executes in real-time when a user submits a query, with the goal of generating a knowledge-based, accurate answer:
- Retrieve: Receives user queries and converts them into query vectors using the same embedding model as in the indexing phase. Then, the query vector is used to conduct a similarity search in the vector database to find the top-K text blocks that are most related to the query semantics.
- Augment: These retrieved blocks of text are used as contextual information and combined with the user’s original query to form an “Augmented Prompt”
- Generate: Input this enhanced prompt into a large language model (LLM). Based on its own language capabilities and newly provided contextual information, LLMs generate a final, human-readable, and fact-based response.
3.2 Component Depth Analysis: The building blocks of RAG
A fully functional RAG system consists of several core components that work together.
Data Sources:
The capabilities of a RAG are largely dependent on the knowledge it has access to. Modern RAG systems can handle many types of data, including:
- Unstructured data: Such as PDF documents, Word files, web pages, plain text, etc., which are the most common data sources.
- Structured data: Such as tables and knowledge graphs in SQL databases. Through specific techniques, such as Text-to-SQL, RAG can query these structured data sources
- Semi-structured/multimodal data: Such as complex documents containing images, tables, and text, or even standalone image and video files.
Data Loading & Chunking:
This is the starting point for the RAG pipeline.ChunkingIt is the process of cutting a long document into small pieces, and its importance is reflected in:
1) Adapt to LLM’s limited context window;
2) Improve the relevance of the search, as small chunks are usually more topically focused. However, chunking can also have drawbacks, as improper segmentation can undermine the semantic integrity of the original text, such as cutting a complete table or a continuous piece of argument, thus affecting the quality of subsequent steps.
Embedding Models: The embedding model is the “translator” of the RAG system, responsible for converting textual information into a mathematical form that the machine can understand, which is the vector we discussed earlier. Its core role is to capture textSemantic meaning。 To ensure that queries and documents are compared in the same semantic space, indexed documents and encoded queries must be usedThe same embedding model。 There are many mature embedding models to choose from, such as OpenAI’s text-embedding series and NVIDIA’s NV-embed series.
Vector Stores/Databases: These are databases specifically designed for storing and efficiently querying high-dimensional vectors. Unlike traditional databases, their core capability is to perform Approximate Nearest Neighbor (ANN) searches
For example, there is now a huge challenge: there are millions, if not billions, of books in libraries. How can I quickly find the ones closest to you?
If it’s an accurate lookup (Nearest Neighbor, NN), it’s the dumbest but most accurate way, take out a ruler, measure the distance between you and each book in the library, and then compare to find the nearest one. This is “precise lookup”. Its results are 100% accurate. But its problem is fatal: when the number of books reaches the level of millions or billions, it will take minutes or even hours to measure them one by one. This is completely unacceptable in RAG applications that require real-time Q&A.
Now it has become, intelligently search (approximate nearest neighbor, ANN) In order to achieve a “second-level” response, the vector database adopts a smarter strategy, for example, we make a guide map before looking for a book, and search by area to see if you find the whole libraryThe most precise and closeYour book?Not necessarily! Perhaps the closest book happens to be on the border of a neighboring area that you haven’t checked. But the book you find is already “very, very close”(For example, the second or third closest in the whole library). This level of precision is completely sufficient for a question and answer task. The essence of ANN: sacrifice a little bit of “absolute accuracy” in exchange for thousands of times more “query speed”. This is the Approximate Nearest Neighbor (ANN) search. The word “approximation” is the essence, and it represents a trade-off between efficiency and precision.
So the end result is to be able to quickly find the vector that is most similar to the query vector on a large dataset. Popular vector databases include Pinecone, Milvus, Chroma, Weaviate, etc.
3.3 First Task: Alleviate Hallucinations and Enhance the Factual Base
The RAG architecture was designed to solve several fundamental problems with standard LLMs.
Definition of the problem to be solved:
- Hallucinations:It refers to the fact that LLMs will fabricate information that sounds reasonable but is actually wrong or fictional when they lack relevant knowledge. This is one of the most criticized problems for LLMs.
- Knowledge Cutoff: LLMs’ knowledge is static and limited to the point in time when their training data is cut-off. It knows nothing about the new events and discoveries that happened afterwards.
- Lack of domain/know-how: Generic foundation models are not trained on in-house, private data and therefore cannot answer questions related to a specific organization or area of expertise.
RAG as a solution:
RAG addresses these issues through a core mechanism:Factual Grounding。 It forces that the LLM’s generation process must be based on externally retrievable, verifiable, and up-to-date facts, rather than relying solely on its internally solidified parametric memory. This mechanism offers multiple benefits:
- By providing accurate context,Significantly reduced the incidence of hallucinations。
- By connecting to a knowledge base that can be updated in real time,The problem of knowledge cut-off has been overcome。
- With secure access to private databases,Enables LLMs to leverage proprietary knowledge, while protecting data privacy.
At the system level, RAG performs like an interlocking chain, and its ultimate strength depends on the weakest link.
A top-tier generator (LLM) also can’t compensate for the shortcomings of poor context provided by a bad retriever. Similarly, a perfect retriever cannot do anything if the knowledge base it relies on has fatal flaws in the initial chunking phase (e.g., splitting key information into two disconnected blocks of text).
These issues can occur when content is missing from the knowledge base, the retriever fails to find relevant documents, retrieved documents are ignored during integration, or ultimately the LLM fails to correctly extract answers from the context provided. This shows that building a high-performance RAG system is not just an “LLM optimization problem” but a complex “system engineering problem”.
Therefore, if a team wants to do a good job in RAG, it needs a development team that is very clear about every link from data cleaning, ingestion to final generation and output, and not only needs to understand prompt engineering, but also needs data engineering and information retrieval.
Chapter 4: The evolution trajectory of the RAG paradigm
Since its inception in 2020, RAG technology has undergone rapid iteration and development to meet increasingly complex application needs. Its evolution path can be clearly divided into three main stages: Naive RAG, Advanced RAG, and Modular RAG. This evolution reflects the rapid maturation of the field from simple proofs of concept to complex production-grade systems.
RAG paradigm evolution comparison
4.1 Primary RAG: The foundational “retrieval-read” model
Primary RAG, or naïve RAG, is the most basic implementation form of RAG. It strictly follows a simple, linear “index-> retrieve->generation” pipeline without any advanced optimization strategies. This is basically consistent with the conceptual model originally proposed by Lewis et al.
The process is straightforward. When a user enters a query, the system starts encoding the query as a vector. A similarity search is performed in the vector database to retrieve the Top-K most relevant text blocks. Stitch these blocks of text with the original query into an enhanced prompt. Finally, the prompt is fed into the LLM to generate the final answer.
The same disadvantages are obvious. With its application in more complex scenarios,The search quality is low, the retrieved block of text may coincide with keywords that are only superficial to the query, but are not semantically related, introducing a lot of noise. At the same time, the generation stage is also affected due to the low quality of the search. The generated answers can be repetitive, redundant, logically incoherent, and even hallucinate when the retrieved information is full of noise or inadequate.
4.2 Advanced RAG: A multi-pronged optimization approach
To put it simply, advanced RAG is based on the traditional “search first, answer later” model, adding some “preparation” and “processing” steps to make the results more reliable.
This is mainly divided into two major steps:
Optimize your search for “raw materials”
This step is to make our knowledge base and user questions better before searching.
Optimize the knowledge base (index optimization): Instead of simply cutting the article into paragraphs, it is intelligent segmentation (semantic chunking), which divides the article according to the meaning to ensure that the meaning of each paragraph is complete and coherent, rather than being disconnected in the middle of the sentence.
Label documents (metadata and hierarchical indexes) and label each paragraph with author, date, chapter, etc. When searching, you can filter by tags first, or search for a “content summary” first, find the relevant long article, and then go to the article to find specific paragraphs. It’s as efficient as finding a chapter through a book’s table of contents and then going to the specific content.
Optimize user questions (query conversion): Make the questions that users may be vague to be clearer and more suitable for machine search.
Help users ask better questions (query rewrite): Use AI models to rewrite users’ simple questions into a more specific and standard question. For example, you ask, “What are the disadvantages of RAG?” The system will automatically help you change it to “What are the main technical challenges and limitations of the retrieval-augmented generation system in practical application?” “This makes it easier to find accurate answers.
Let the AI “guess” a perfect answer (hypothetical document embedding): This method is ingenious. Instead of directly searching for your question, the system first asks the AI to “imagine” and generate the most perfect answer (a “fake” document) based on your question. Then, the system takes this “hypothetical perfect answer” to find the most similar real document in the knowledge base. Because this hypothetical answer and the real answer will be very close in meaning.
Post-retrieval strategy: Refine the context
These strategies occur after retrieval and before generation, with the goal of filtering and purifying the retrieved preliminary results to provide the LLM with the highest quality context.
Re-ranking: This is a two-stage filtration process. First, a fast but relatively crude retriever (such as vector search) recalls a large set of candidates (e.g., Top 50) from a large number of documents. Then, a more powerful and complex model that is also more computationally expensive (usuallyCross-encoderThis small set of candidates is double-scored and sorted to find the most relevant documents. Cross-encoders are capable of processing queries and documents simultaneously, allowing for deeper correlation judgments with much higher accuracy than dual encoders21.
Context Compression/Selection: Actively compress and filter retrieved content before feeding it into the LLM. This includes removing sentences or paragraphs that are not relevant to the query or summarizing multiple documents to remove noise and redundant information. The benefits of this are twofold: first, it helps LLMs focus on the most critical evidence and avoid “information overload”; Second, it can effectively manage the number of tokens input to the LLM, preventing it from exceeding its context window limit22.
4.3 Modular RAG: Towards a composable, flexible, and scalable architecture
Modular RAG is not only a collection of technologies, but also represents a fundamental oneA paradigm shift in system design。 It decomposes the originally linear RAG pipeline into multiple independent, pluggable, and independently optimized functional modules, such as retrieval, inference, memory, and generation.
Core components and concepts:
- Search Module: This is no longer a single retriever, but a composite module that can integrate multiple search strategies (e.g., vector search, keyword search, knowledge graph search). It can even include a “Query Router” that intelligently distributes the query to the most appropriate retrieval method based on its type and intent.
- Reasoning Module: The module can perform more complex operations, such as breaking down complex problems into multiple sub-problems (Query Decomposition), and then performing iterative retrieval, that is, generating new queries based on the results of the first round of retrieval, conducting multiple rounds of search, and simulating the human research process.
- Memory Module: This module can integrate conversation history, enabling RAG to handle multiple rounds of conversations. More advanced implementations can even use the content generated by the LLM itself as a kind of “self-memory” to be used in subsequent generations, allowing for continuous learning.
- Fusion/Merging Module: When the system obtains multiple result sets through multi-query or multi-source retrieval, an intelligent module is required to merge these results. For example, RAG-Fusion technology utilizes reordering algorithms to consolidate results from multiple subqueries to improve the robustness of the final retrieval.
- Feedback Loops: The modular architecture makes it easier to introduce feedback mechanisms. For example, you can use the user’s implicit feedback (e.g., clicks) or explicit feedback (e.g., scoring) to continuously optimize the performance of retrieval modules or build modules through reinforcement learning (e.g., RLHF).
Let’s start with a summary, beginner RAG is like a simple, monolithic Python script, which is enough to complete a function demonstration. Advanced RAG is like adding specific libraries and functions to this script to optimize performance. Modular RAG represents the directionBased on the concept of microservicesleap. Each component (retrieval, reflow, build) is treated as a separate, independently deployable, and scalable service, and they communicate with each other through well-defined APIs.
Chapter 5: Next-generation RAG architecture
5.1 Agent Adaptive RAG: The dawn of autonomous multi-step reasoning
The current mainstream paradigm of RAG is still a linear pipeline, while passively responding to a single query from users. An important evolution direction in the future is to change from this passive “assembly line” model to an active one.”Proxy RAG“(Agentic RAG) model.
From pipeline to agent
In the proxy paradigm, the role of LLMs has undergone a fundamental shift. It is no longer just the end point of the pipeline (generator), but becomes a capable onePlanning, reasoning, and decision-makingAutonomous Agent. The retrieval system has changed from a fixed processing link to a “tool” that agents can call on demand.
Think of the new generation of RAG systems as an AI that has been upgraded from a “junior employee” to a “senior expert”. This “expert” has three core professional competencies:
Iterative Reasoning and Retrieval
This changes the simple model of “one question, one search, one answer” in the past. Now, when AI faces a complex problem,It’s more like a strategy analyst doing in-depth research。
Working mode: It will first break down the big problem into several logical subproblems. It then conducts a first round of searches, and then based on the information obtained,Dynamically generate new, more accurate queries, carry out the second and third rounds of exploration, step by step.
Professional valueThis ability to iterate allows it to handle complex or ambiguous queries that are intertwined and cannot be covered by a single search, gradually approaching the most comprehensive answer.
Dynamic Tool Use
Here, the Agent plays the role of an intelligent task dispatcher and has the right to make decisions independently.
Working mode: It can determine which tool to call in real time based on the specific nature of the problem. For example, it analyzes whether it should query an internal vector database, connect to a SQL database to extract structured data, or perform a web search to get the latest developments.
Professional value: This reflects the flexibility of the system and the ability to optimize resources. It is no longer limited to a single source of knowledge, but canIntegrate and invoke the most appropriate toolsto complete the task, greatly broadening its application scenarios and the upper limit of problem solving.
Self-Correction and Reflection
This is equivalent to a set of “metacognition” and “quality control” mechanisms built into the system.
Working mode: In the process of retrieval and reasoning, it continuously evaluates the quality of the information it finds. If it is judged that the current information is not relevant or not enough to form a high-quality answer, it canTake the initiative to “stop” and “reflect”, and then try a new search strategy or switch to a different tool.
Professional value: This greatly improves the systemRobustness, avoiding “one road to black” on the wrong or inefficient path, and realizing dynamic self-optimization and error correction.
5.2 Multimodal and graph-enhanced RAG: Beyond the boundaries of knowledge in text
The RAG system will no longer be just a “text processing expert”, it is evolving, learning how to understand, understand, and understand the complex relationships between things.
Multimodal RAG (Multimodal RAG): Let RAG have “five senses”
The core of this direction is to allow RAG to break down the dimensional wall of the text, able to:Understand and correlate multiple types of data, such as images, sounds, videos, and more。
How it works:: It relies on a technique called “multimodal embedding model”. You can think of this model as a “universal translator” that translates the content of an image, the meaning of a paragraph, or even the meaning of an audio into a universal “mathematical language” (i.e., mapped to a shared vector space). In this way, the comparability of different types of information is opened.
Professional value: This makes “cross-modal retrieval” possible.
For example, you can ask the text “Show me all the X-rays about ‘bone cracks'”, and the system can directly understand and find out the relevant medical images. Going further, the system can act like a panel of experts, simultaneously analyzing the patient’s X-rays (images), electronic medical records (structured data), and relevant medical literature (text), ultimatelyAll information is combined to generate a more comprehensive and reliable diagnostic recommendation。
Graph Augmented RAG (GraphRAG): Giving RAG a “Logical Reasoning Brain”
This method introduces:Knowledge Graphs (KGs), equipped RAG with a structured, logical “brain” that can be used to supplement or even replace traditional text libraries.
What is a knowledge graph?: It is not a cluttered pile of documents, but a “entity-relationship-entity”Giant network of relationships。 For example, “Tom Hanks” is one entity, “Forrest Gump” is another entity, and “starring” is the relationship between them. All knowledge is connected in this clear, unambiguous way.
Compared with traditional text retrieval, GraphRAG has two core advantages:
Accurate enough: Facts in a knowledge graph are structured, as clear as entries in a database, without the ambiguity and ambiguity common in text. The search results are therefore extremely accurate.
Multi-hop Reasoning: This is its most powerful ability. When answering a question requires concatenating multiple facts, GraphRAG can be built on this “relational network”Make a “jump” to discover deep, indirect connections。
For example, you ask, “Which director directed a film starring Tom Hanks that won an Oscar for Best Picture?” ”
Its reasoning route is that the system will start from the point of “Tom Hanks”, find all the movies he “starred in” (first jump), then filter out those that “won the Oscar for Best Picture” (second jump), and finally follow this line to find the “director” of the corresponding film (third jump). This reasoning ability to “connect scattered information points” is crucial for mining hidden relationships from massive data.
Finally, a summary
First, we can reach a consensus: RAG (Retrieval-Augmented Generation) is no longer just a clever technical tool, it has evolved into a core pillar of modern AI applications, especially enterprise-level AI. Its development process clearly reveals an important shift in the entire AI field:We are moving from a single cult of “bigger models” to building “smarter, more efficient hybrid systems”。
The core value of RAG is that it has successfully made large models better in applications by “plugging in” knowledgeAccurate, controllable, trustworthy, and cost-effective。
Some thoughts on future development:
Based on current developments, the following directions are not only the focus of RAG’s future research, but also strategic issues worthy of our in-depth consideration.
Thinking point 1: How to balance the “cost and benefit” of the RAG system?
With the advent of Agentic RAG, systems have become more powerful than ever, capable of complex reasoning and multi-step operations. But this immediately brings up a real trade-off:
What is the price of intelligence? A more complex inference chain necessarily means longer response latency and higher computational costs.
How do we choose? The challenge for the future is how to design oneAdaptive control system, allowing it to dynamically find the best balance between “extreme intelligence” and “cost-effectiveness” according to the importance and complexity of the task. This is not only a technical issue, but also a key to determining whether it can be commercialized on a large scale.
Thinking point 2: How to cross the gap between modalities?
The goal of multimodal RAG is to enable AI to comprehensively process text, images, data, and other information like humans. But the real challenge lies:
How to achieve deep integration? Current technology is more about “stitching” information from different sources. The breakthrough point in the future is how to make the model realUnderstand and reasonThe deep correlation between different modal information, resulting in 1+1>2 “knowledge emergence”?
For example, the system not only finds design drawings and sales reports, but also understands how a design change on the drawings led to negative feedback in the sales report.
https://www.numberanalytics.com/blog/future-of-open-domain-question-answering
https://www.forbes.com/councils/forbestechcouncil/2025/06/23/how-retrieval-augmented-generation-could-solve-ais-hallucination-problem/
https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
https://www.ibm.com/think/topics/rag-techniques