Skip to content

RAG: Retrieval-Augmented Generation

Preface

Why does ChatGPT sometimes "make things up with confidence"? Large language models derive their knowledge from training data, but training data has a cutoff date and doesn't include your company's internal documents. RAG (Retrieval-Augmented Generation) is the core technology that solves this problem — letting AI "look up references" before answering.

What will you learn from this article?

After completing this chapter, you will gain:

  • Core concept understanding: Know what RAG is, why it's needed, and how it solves the "hallucination" problem of large models
  • Complete workflow understanding: Master the end-to-end process from document loading, chunking, vectorization to retrieval and generation
  • Technology selection ability: Understand the pros and cons of different chunking strategies and retrieval methods, and make choices based on scenarios
  • Architecture evolution perspective: Understand RAG's evolution from Naive to Advanced to Modular
  • Practical decision-making ability: Know when to use RAG and when to use fine-tuning
ChapterContentCore Concepts
Chapter 1RAG Basic WorkflowIndexing, Retrieval, Generation stages
Chapter 2Text Chunking StrategiesFixed chunking, semantic chunking, recursive chunking
Chapter 3Retrieval TechniquesVector retrieval, keyword retrieval, hybrid retrieval
Chapter 4Architecture EvolutionNaive RAG → Advanced RAG → Modular RAG
Chapter 5RAG vs Fine-tuningComparison of applicable scenarios

0. Overview: Why Do Large Models Need to "Look Up References"?

Imagine you're a knowledgeable professor who has read countless books. But if someone asks you "what were yesterday's sales figures," you certainly can't answer — because that information isn't in the books you've read.

Large language models face exactly the same dilemma:

  • Knowledge has a cutoff date: GPT-4's training data has a cutoff point, so it doesn't know what happened after that
  • Lacks private knowledge: Your company's internal documents, product manuals, and customer data have never been seen by the model
  • Prone to hallucination: When the model is uncertain about an answer, it tends to "fabricate" a plausible-looking response

RAG's Core Idea

RAG's solution is very intuitive: Before letting the model answer, first help it find relevant reference materials. It's like an open-book exam — you don't need to memorize everything; you just need to know where to find it and how to look.

RAG = Retrieval + Augmented + Generation


1. RAG Basic Workflow: Indexing, Retrieval, Generation

RAG's workflow can be divided into two phases: offline indexing and online querying.

The offline phase is like a library's cataloging work — classifying, numbering, and shelving all books for easy future retrieval. The online phase is the process of a reader coming to the library to look up information — finding relevant books based on a question and then synthesizing the information to provide an answer.

选择问题:
💬
用户提问
我们公司的年假政策是什么?
🔍
语义检索
📋
上下文组装
🤖
LLM 生成
返回结果
用户提问 — 详细说明
用户向系统提出一个自然语言问题。这个问题会被转化为向量表示,用于后续的语义检索。
1 / 5

Three Core Stages

  1. Indexing Stage: Load, clean, and chunk original documents, then convert them into vectors through an embedding model and store them in a vector database. This is a one-time preparation step.
  2. Retrieval Stage: When a user asks a question, convert the question into a vector as well and search for the most similar document chunks in the vector database.
  3. Generation Stage: Combine the retrieved document chunks with the user's question into a Prompt, and pass it to the large model to generate the final answer.
StageInputOutputKey Technology
IndexingOriginal documentsVector databaseText chunking, embedding model
RetrievalUser questionTop-K document chunksVector similarity, reranking
GenerationQuestion + contextFinal answerPrompt engineering, LLM

2. Text Chunking: Fitting the Elephant into the Refrigerator

Text chunking is the most easily overlooked yet most impactful step in RAG. Why is chunking needed? Because large models have limited context windows, and we can't stuff an entire book in. More importantly, chunking quality directly determines retrieval quality.

Imagine looking for a specific piece of knowledge in a book at the library. If the entire book is one "chunk," finding it is useless — you'd still have to flip through the whole book. But if it's chunked by chapter or even paragraph, you can precisely locate the content you need.

输入文本
固定大小
按照固定的字符数切分文本,是最简单直接的分块方式。通常会设置一定的重叠区域(overlap),避免在切分边界丢失上下文。
块大小: 80 字符重叠: 20 字符
分块结果 共 0 个块
请输入文本后查看分块结果
策略优点缺点适用场景
📏 固定大小实现简单,块大小均匀可能在句子中间截断结构化程度低的长文本
📝 按句子保持句子完整性块大小不均匀文章、报告等自然文本
🧠 语义分块主题连贯,语义完整计算成本高,需要嵌入模型多主题混合的复杂文档
🔄 递归分块兼顾结构与大小实现较复杂通用场景,推荐默认选择

Choosing a Chunking Strategy

  • Fixed-size chunking: Split by character count or token count — simple but may break semantics
  • Recursive chunking: First split by paragraphs; if paragraphs are too long, split by sentences — preserves semantic integrity
  • Semantic chunking: Use embedding models to detect semantic boundaries, splitting where similarity drops sharply
  • Document structure chunking: Use structural information like Markdown headings and HTML tags for chunking

There is no "best" chunking strategy, only the one most suitable for your data. Generally, start with recursive chunking, chunk size 200-500 tokens, overlap 10-20%.


3. Retrieval Techniques: How to Find the Most Relevant Content?

After chunking is complete, the next key question is: When a user asks a question, how do you find the most relevant chunks from thousands of document segments?

This is like searching for books in a huge library. You can search by book title keywords (keyword retrieval), describe what you want and let the librarian help (semantic retrieval), or best of all, combine both approaches (hybrid retrieval).

选择查询:
1
查询编码
2
向量搜索
3
重排序
4
Top-K 选择
查询编码
将用户的自然语言查询通过嵌入模型(如 text-embedding-ada-002)转化为高维向量表示。这个向量捕捉了查询的语义信息。
查询文本
如何申请年假?
↓ 嵌入模型编码
查询向量
0.12-0.450.780.33-0.210.560.89-0.14
Retrieval MethodPrincipleAdvantagesDisadvantages
Keyword Retrieval (BM25)Based on term frequency and inverse document frequencyExact matching, fastCannot understand semantics, fails with synonyms
Vector RetrievalBased on cosine similarity of embedding vectorsUnderstands semantics, supports fuzzy matchingLess sensitive to proper nouns
Hybrid RetrievalFuses keyword and vector retrieval resultsBalances precision and semanticsRequires weight tuning, higher complexity

Reranking

After retrieving candidate documents, a "reranking" step is usually needed. Initial retrieval focuses on recall (try not to miss anything), while reranking focuses on precision (put the most relevant at the top). Common reranking models include Cohere Rerank and BGE Reranker, which use cross-encoders to finely score query-document pairs.


4. Architecture Evolution: From Simple to Intelligent

RAG technology has gone through three generations of evolution in just two years, with each generation solving the pain points of the previous one.

最基础的 RAG 架构,流程简单直接:索引 → 检索 → 生成。适合快速原型验证,但在复杂场景下效果有限。
📄
文档加载
✂️
文本分块
🔢
向量化
🔍
检索
🤖
生成
点击流程节点查看详细说明
架构特点
实现简单,上手快
适合结构化知识库
⚠️检索质量依赖分块策略
无法处理复杂查询
架构演进路线
Naive RAG
2023
Advanced RAG
2024
Modular RAG
2025

Comparison of Three RAG Generations

  • Naive RAG (2023): The most basic "index → retrieve → generate" workflow. Simple to implement but limited effectiveness. Issues include: unstable retrieval quality, inability to handle complex queries, and easy introduction of noisy context.
  • Advanced RAG (2024): Built on top of Naive RAG with added query rewriting, hybrid retrieval, reranking, context compression, and other optimization steps, significantly improving retrieval precision and generation quality.
  • Modular RAG (2025): Decomposes RAG into pluggable modules, supporting routing decisions, adaptive retrieval, self-reflection, and other advanced capabilities. Can dynamically select the optimal processing workflow based on query type.

5. RAG vs Fine-tuning: Which Should You Choose?

When you want a large model to master domain-specific knowledge, there are usually two paths: RAG and fine-tuning. They are not mutually exclusive but complementary.

To use an analogy: Fine-tuning is like sending a student to training classes, internalizing knowledge into their brain; RAG is like giving a student reference books that they can consult during exams. Both approaches have their pros and cons; the key is your specific needs.

RAG 检索增强生成
VS
Fine-tuning 微调
知识更新速度
实时更新,修改文档即生效
需要重新训练,周期长
实施成本
搭建检索系统,成本适中
💰
需要 GPU 资源和标注数据
回答风格控制
依赖 Prompt 工程
🎨
可深度定制输出风格
幻觉控制
有据可查,可追溯来源
🎯
仍可能产生幻觉
推理延迟
需要额外的检索步骤
⏱️
直接生成,无额外开销
私有数据安全
数据留在本地,不进入模型
🔒
数据融入模型权重
一句话总结
RAG 像是给模型配了一个实时更新的参考书库,适合知识频繁变化的场景; 微调像是让模型上了一门专业课,适合需要特定风格或领域深度的场景。 实际项目中,两者常常结合使用。
DimensionRAGFine-tuning
Knowledge UpdatesReal-time updates; just modify documentsRequires retraining
CostLow (no GPU training needed)High (requires training resources)
ExplainabilityHigh (traceable sources)Low (knowledge internalized in weights)
Applicable ScenariosKnowledge base Q&A, document retrievalStyle transfer, specific task optimization
Hallucination ControlBetter (has reference basis)General (may still hallucinate)

Practical Advice

In most scenarios, try RAG first. RAG's advantages include: no training required, real-time knowledge updates, and traceable answer sources. Only consider fine-tuning when you need to change the model's "behavioral patterns" (such as output format, language style, or reasoning approach). The strongest solution is often a RAG + fine-tuning combination.


Summary

RAG is currently one of the most practical technologies for putting large models into production. Its core value lies in: making model answers verifiable, knowledge updateable in real-time, and hallucination effectively controlled.

Key takeaways from this chapter:

  1. The core problem RAG solves: Outdated model knowledge, lack of private data, and tendency to hallucinate
  2. Three-stage workflow: Indexing (offline preparation) → Retrieval (online search) → Generation (comprehensive answer)
  3. Chunking is foundational: Chunking quality directly determines retrieval quality; choosing the right chunking strategy is crucial
  4. Retrieval is key: Hybrid retrieval + reranking is currently the best-performing combination
  5. Architecture is evolving: From Naive RAG to Modular RAG, systems are becoming increasingly intelligent and flexible
  6. RAG and fine-tuning are complementary: Try RAG first in most scenarios; consider fine-tuning when you need to change model behavior

Further Reading