In enterprise-level scenarios, an accurate knowledge base Q&A tool is crucial. This article provides an in-depth analysis of RAG (Retrieval-Augmented Generation) technology, from knowledge extraction, chunking, embedding, storage and indexing, retrieval to answer generation and effect evaluation, and elaborates on its core selection and optimization ideas, helping readers master the complete strategy of building a high-precision knowledge base assistant.
Low-code agent builders such as Dify and Coze encapsulate various capabilities in RAG for users to configure with a few clicks on the GUI interface. This creates an illusion for many users – as if they can configure a knowledge base Q&A tool with a drag and drop to create an enterprise-level knowledge base assistant.
(Source: ChatGPT helped generate it)
From the actual landing, the ability of the upper layer to package well has its limitations, and the upper limit of the accuracy of Q&A that can be achieved by the low-code platform is obvious, and maybe 50 or 60 points are considered very good; But this score is completely unavailable for enterprise-level scenarios. Would you allow AI to deviate in answering some corporate financial and administrative questions? Therefore, the process from 50 to 90 points is where RAG really shows its power.
From knowledge extraction, vectorization, chunking, indexing, retrieval to final generation, there are various optimization strategies to choose from at each step, different strategies adapt to different scenarios, data quality and generation requirements.
It can be said that the real RAG work is superimposed on a series of complex and detailed optimization strategies, which not only requires you to update your knowledge system at any time and master the latest optimization direction process, but also requires you to understand the data form and business scenarios, and be able to combine the final generation demands to work backwards on how to choose the cooperation of this series of strategies.
I learned about what RAG is in an article before? This article briefly introduces the core technologies of RAG. So, this article will share the core selection and optimization ideas in each link as a RAG strategy map for everyone to exchange and learn.
01 Extracting
The forms of knowledge can be divided into: structured (tables), semi-structured (web pages), and unstructured (PDF, Word, etc.). Unlike structured data such as databases, knowledge bases often contain a large amount of unstructured data (such as video, audio, PDF, web pages, etc.), which greatly expands the knowledge but also brings technical difficulties for accurate identification.
Frameworks like Dify, Langchain, and LlamaIndex all come with some extractors, but they also support rich integration of other loaders. Taking Dify as an example, it not only supports self-developed file extraction solutions, but also supports Unstructured’s extraction solutions.
At present, some of the more common external extraction tools on the market are:
Among them, Unstructured is a popular general extraction tool at present, supporting a variety of common and rich document formats, and is suitable as a basic general extraction tool selection. The difficulty of the extraction stage is actually – PDF and image text extraction.
After 10 years of interaction design, why did I transfer to product manager?
After the real job transfer, I found that many jobs were still beyond my imagination. The work of a product manager is indeed more complicated. Theoretically, the work of a product manager includes all aspects of the product, from market research, user research, data analysis…
View details >
The difficulty of PDF is that its flexible and rich layout inherently contains and nests a lot of relationships, such as a picture inserted in the middle of a pile of text, which may be a schematic diagram of the previous paragraph; At the same time, the PDF format flattens the layout of titles, subheadings, first/second points, etc., and it is difficult to easily read the content structure by reading the title and body in the web page.
Most of the officially circulated documents of enterprises are in PDF format (cannot be tampered with and edited at will), so it is necessary to apply PDF extraction tools (Pymupdf, MinerU, PyPDF) and other special processing of PDF files, these tools are characterized by: will do separate adaptation and processing of PDF format files, similar to an element parser, which can clearly read what different elements are – such as titles, text, headers, footers, illustrations, etc.
In addition, a large number of documents still exist in the enterprise are pictures, and the accurate identification of pictures is especially used in the financial industry. Taking a fund company as an example, it needs to review the information submitted by the new manager, which contains a large number of photos of the manager’s education, resume and other photos; In addition, in response to medium-term regulatory demands, it is necessary to regularly collect bank electronic receipts of funds for regulatory review.
For example, the relatively small letter of the taxpayer identification number in the picture above is less effective when using a large model directly, and we usually use OCR (Optical Character Recognition) to achieve it.
At present, among the products we have applied ourselves, the overall effect and cost performance of the closed-source tool Textin and the open-source tool Baidu flying pulp are quite controllable, and you can also try it in combination with your own business to balance the relationship between accuracy and cost.
02 Chunking
After the knowledge extraction is completed, we have a large amount of knowledge information, which may be text, pictures, etc., and this knowledge is organized into a collection of documents. However, before handing over the large model for vectorization, it needs to be chunking. Why do we need to chunk instead of throwing an entire document to a large model? This is because the context of a large model is limited at one time.
For example, the context length of Qwen3 is 32,768 tokens, about 50,000 words), and these contexts are not only the length of the content block recalled by the query knowledge base, but also user question query, prompt, etc. In addition, even if large models have been trying to increase the length of the context for a while, sufficient context does not mean accuracy, and it may recall interfering content blocks, making it easier to cause model hallucinations.
Therefore, in the context of limited context length, chunking technology can be retrieved more accurately, thereby reducing model illusion and computing power costs. So, what logic should be used for chunking? Common chunking methods are as follows:
Of course, the actual logic of blocking needs to be gradually adjusted. For example, you can first divide the tiles according to the most conventional fixed number of characters, and see the effect by checking the tile and recall tests; If the effect is not good, adjust the number of characters, add separators for recursive chunking, or even manually adjust the chunk content.
In addition, chunking itself is to serve retrieval, which inevitably requires index-oriented chunking logic. Common chunking techniques that logically echo subsequent indexes in the chunking stage are as follows: Mix to generate parent-child text blocks: Mr. into larger text blocks, and then cut them into smaller sub-text blocks, and the parent-child text blocks are mapped and associated with IDs.
In the retrieval stage, the child text block is retrieved first, and then its parent text block is found through the ID, so that the two are passed to the large model together to improve richer and more accurate answers.
Generate text block metadata: After blocking, the corresponding metadata (such as title, page number, creation time, file name, etc.) can be synchronized for the text block, so that it can be combined with metadata as a filter to retrieve more efficiently (this function has been configured in Dify v1.1.0 version) Generate summary + detail text block:
Similar to the parent-child relationship, the summary is from coarse to shallow, to generate summary information for the document, and then associate the summary and the detailed text block to generate a recursive multi-level index: similar to parent-child, summary + detail, the recursive type is to divide the index tree into more levels, and the top down is gradually from coarse to fine information will be specially expanded in the future.
03 Embedding
After the blocking is completed, the next step is to semantically understand and code the knowledge of these different blocks, which is also the first time that a large model needs to be used in the entire RAG process. There are two common embeddings – sparse embeddings and dense embeddings, and we usually talk about dense embeddings. In short, dense embeddings capture semantic relationships better, while sparse embeddings are more efficient in computational storage.
1. Dense embeddings are a representation method that maps discrete symbols (such as words, sentences, users, items, etc.) into a low-dimensional continuous vector space. In this vector, most of the elements are non-zero real numbers, and each dimension implicitly expresses a certain semantic or characteristic.
2. Sparse embeddings are a representation method that maps data into a high-dimensional vector space, where most dimensions have a value of 0 and only a few dimensions have non-zero values. At present, the most commonly used method is to combine the two to achieve mixed retrieval, dense embedding is responsible for capturing semantic relationships, and sparse embeddings are more used such as BM25 (matching documents and queries based on word importance), which not only achieves semantic relevance, but also achieves the accuracy of keyword matching.
Common densely embedded large models include OpenAI, Jina, Cohere, Voyage, and Alibaba Qwen, which can be viewed https://huggingface.co/spaces/mteb/leaderboard the ranking of the world’s newer embedding models.
As of that day, the first place in the multilingual embedding model is gemini-embedding-001, and the second, third and fourth places are actually Alibaba’s Qwen-embedding series, which is quite surprising. However, the ranking is for reference only, and you still need to measure it according to your actual task type.
In addition, not only can the generative model be fine-tuned (we often say that the large model fine-tuning refers to the biased generation response side large model), in fact, the embedded model also supports fine-tuning, but few companies are involved. If you have some highly specialized knowledge (e.g., medicine, lawyers), specific formatting requirements, or cultural localization needs, the final step is to consider fine-tuning the embedded model.
By fine-tuning, better text embeddings can be generated, bringing semantically similar text closer together in the embedding space.
04 Knowledge storage & indexing
After embedding, we will generate a large amount of embedded data, which of course cannot be stored in our common relational/non-relational databases, but requires a specific vector database to store vectors in the form of embeddings.
The goal of storage is to make retrieval better and faster, so we will expand storage and indexing together in this section. Let’s take a look at what vector databases are there first? At present, the more popular ones are Milvus, Faiss, Chroma, Weaviate, Qdrant, Pinecone, ElasticSearch, and of course, major domestic manufacturers (such as Tencent) have also established vector database ecosystems.
If you want to test lightweight and apply small projects, you can choose Faiss (Facebook’s open-source vector database); If you’re a business business, consider Milvus; If you have been using ElasticSearch’s search/database functionality before, you can also consider using their vector database feature. In addition, Dify’s official default vector database is Weaviate, indicating that this component is also OK for enterprise commercial use.
(Source: Huang Jia’s “RAG Practical Class”)
When we store vectors in the database, we need to create an index accordingly. Indexing is the process of organizing data efficiently, like a guide diagram after we go to a hospital, and it plays an important role in similarity retrieval by significantly reducing time-consuming queries on large datasets. Common indexing methods are as follows:
(Source: Huang Jia’s “RAG Practical Class”)
Here are three core indexing ideas: FLAT precise search: violent traversal of all data, of course, only suitable for small batches of data IVF_FLAT Inverted file index + precise search: divide the vector data into several clusters, calculate the distance between the query vector and the center of each cluster, find the n clusters with the highest similarity, and then retrieve the target vector in these clusters.
Just like if you want to find where the “cat” is, first quickly find where the “animal” cluster is. HNSW Nearest Neighbor Search Based on Graph Structure: One of the best performing ANN (Approximate Nearest Neighbor Search) algorithms, it builds a multi-layer navigation graph (such as top, middle, and bottom), and the density of different levels gradually increases, so that the query can quickly approach the target point like taking a subway. Currently, the default indexing method for Weaviate in Dify is HNSW.
05 Retrieval
After preparing so much, we came to the final retrieval part, and this is also the beginning of the real role of R (Retrieval) in RAG (Retrieval-Augmented Generation). Before retrieval, the common processing methods are as follows, among which query structure transformation and query translation are some commonly used pre-search optimization methods, and there are relatively few query routing applications:
1. Query structure transformation
2. Query translations
3. Query the route
Pre-retrieval ProcessingProcessing InstructionsLogical RoutingSelect the appropriate data source or retrieval method according to the user’s problemSemantic routingChoose the appropriate prompt template according to the user’s problemThrough the above processing, after completing the retrieval, there are also some strategies that can be optimized:
The above provides some optimization ideas before and after retrieval, among which query structure transformation, query translation, and rearrangement are basically some relatively necessary optimization points.
There are also some emerging directions, such as Self-RAG (allowing large models to decide whether to search, what to search, whether they have searched enough, and whether they need to search), allowing large models to optimize the retrieval effect by themselves.
06 Answer Generation
When we retrieve the relevant knowledge blocks, the last step is to feed the user query and retrieved knowledge base text blocks to the large model, so that the large model can use its own capabilities to answer the user’s questions. At this point, the work of the knowledge base RAG is over. So, what else can we do to generate better results?
I won’t expand too much here.
07 Evaluation
Evaluation plays a decisive role in the value of the entire system to some extent, assuming that we want to deliver a knowledge base Q&A product to the customer, what indicators are used to measure the effect has become the key card point of acceptance.
But in fact, the evaluation sets and evaluation models for effect evaluation for different customers and scenarios are different. Here are a few common evaluation indicators or frameworks on the market:
1. Search evaluation evaluation framework focuses on indicators RAG TRIAD (RAG triangle): contextual relevance, fidelity, answer correlation, RAGAS, contextual accuracy, contextual recall, contextual entity recall, noise sensitivity, DeepEval, contextual accuracy, recall correlation, etc
2. Generate evaluation evaluation framework and focus on indicators: RAGAS answer relevance, fidelity, multimodal fidelity, multimodal correlation, DeepEval answer relevance, fidelity, etc