RAG: Retrieval-Augmented Generation

Preface

Why does ChatGPT sometimes "make things up with confidence"? Large language models derive their knowledge from training data, but training data has a cutoff date and doesn't include your company's internal documents. RAG (Retrieval-Augmented Generation) is the core technology that solves this problem — letting AI "look up references" before answering.

What will you learn from this article?

After completing this chapter, you will gain:

Core concept understanding: Know what RAG is, why it's needed, and how it solves the "hallucination" problem of large models
Complete workflow understanding: Master the end-to-end process from document loading, chunking, vectorization to retrieval and generation
Technology selection ability: Understand the pros and cons of different chunking strategies and retrieval methods, and make choices based on scenarios
Architecture evolution perspective: Understand RAG's evolution from Naive to Advanced to Modular
Practical decision-making ability: Know when to use RAG and when to use fine-tuning

Chapter	Content	Core Concepts
Chapter 1	RAG Basic Workflow	Indexing, Retrieval, Generation stages
Chapter 2	Text Chunking Strategies	Fixed chunking, semantic chunking, recursive chunking
Chapter 3	Retrieval Techniques	Vector retrieval, keyword retrieval, hybrid retrieval
Chapter 4	Architecture Evolution	Naive RAG → Advanced RAG → Modular RAG
Chapter 5	RAG vs Fine-tuning	Comparison of applicable scenarios

0. Overview: Why Do Large Models Need to "Look Up References"?

Imagine you're a knowledgeable professor who has read countless books. But if someone asks you "what were yesterday's sales figures," you certainly can't answer — because that information isn't in the books you've read.

Large language models face exactly the same dilemma:

Knowledge has a cutoff date: GPT-4's training data has a cutoff point, so it doesn't know what happened after that
Lacks private knowledge: Your company's internal documents, product manuals, and customer data have never been seen by the model
Prone to hallucination: When the model is uncertain about an answer, it tends to "fabricate" a plausible-looking response

RAG's Core Idea

RAG's solution is very intuitive: Before letting the model answer, first help it find relevant reference materials. It's like an open-book exam — you don't need to memorize everything; you just need to know where to find it and how to look.

RAG = Retrieval + Augmented + Generation

1. RAG Basic Workflow: Indexing, Retrieval, Generation

RAG's workflow can be divided into two phases: offline indexing and online querying.

The offline phase is like a library's cataloging work — classifying, numbering, and shelving all books for easy future retrieval. The online phase is the process of a reader coming to the library to look up information — finding relevant books based on a question and then synthesizing the information to provide an answer.

选择问题：

💬

用户提问

我们公司的年假政策是什么？

→

🔍

语义检索

→

📋

上下文组装

→

🤖

LLM 生成

→

✅

返回结果

用户提问 — 详细说明

用户向系统提出一个自然语言问题。这个问题会被转化为向量表示，用于后续的语义检索。

1 / 5

Three Core Stages

Indexing Stage: Load, clean, and chunk original documents, then convert them into vectors through an embedding model and store them in a vector database. This is a one-time preparation step.
Retrieval Stage: When a user asks a question, convert the question into a vector as well and search for the most similar document chunks in the vector database.
Generation Stage: Combine the retrieved document chunks with the user's question into a Prompt, and pass it to the large model to generate the final answer.

Stage	Input	Output	Key Technology
Indexing	Original documents	Vector database	Text chunking, embedding model
Retrieval	User question	Top-K document chunks	Vector similarity, reranking
Generation	Question + context	Final answer	Prompt engineering, LLM

2. Text Chunking: Fitting the Elephant into the Refrigerator

Text chunking is the most easily overlooked yet most impactful step in RAG. Why is chunking needed? Because large models have limited context windows, and we can't stuff an entire book in. More importantly, chunking quality directly determines retrieval quality.

Imagine looking for a specific piece of knowledge in a book at the library. If the entire book is one "chunk," finding it is useless — you'd still have to flip through the whole book. But if it's chunked by chapter or even paragraph, you can precisely locate the content you need.

输入文本

固定大小

按照固定的字符数切分文本，是最简单直接的分块方式。通常会设置一定的重叠区域（overlap），避免在切分边界丢失上下文。

块大小: 80 字符重叠: 20 字符

分块结果共 0 个块

请输入文本后查看分块结果

策略	优点	缺点	适用场景
📏 固定大小	实现简单，块大小均匀	可能在句子中间截断	结构化程度低的长文本
📝 按句子	保持句子完整性	块大小不均匀	文章、报告等自然文本
🧠 语义分块	主题连贯，语义完整	计算成本高，需要嵌入模型	多主题混合的复杂文档
🔄 递归分块	兼顾结构与大小	实现较复杂	通用场景，推荐默认选择

Choosing a Chunking Strategy

Fixed-size chunking: Split by character count or token count — simple but may break semantics
Recursive chunking: First split by paragraphs; if paragraphs are too long, split by sentences — preserves semantic integrity
Semantic chunking: Use embedding models to detect semantic boundaries, splitting where similarity drops sharply
Document structure chunking: Use structural information like Markdown headings and HTML tags for chunking

There is no "best" chunking strategy, only the one most suitable for your data. Generally, start with recursive chunking, chunk size 200-500 tokens, overlap 10-20%.

3. Retrieval Techniques: How to Find the Most Relevant Content?

After chunking is complete, the next key question is: When a user asks a question, how do you find the most relevant chunks from thousands of document segments?

This is like searching for books in a huge library. You can search by book title keywords (keyword retrieval), describe what you want and let the librarian help (semantic retrieval), or best of all, combine both approaches (hybrid retrieval).

选择查询：

查询编码

向量搜索

重排序

Top-K 选择

查询编码

将用户的自然语言查询通过嵌入模型（如 text-embedding-ada-002）转化为高维向量表示。这个向量捕捉了查询的语义信息。

查询文本

如何申请年假？

↓ 嵌入模型编码

查询向量

0.12-0.450.780.33-0.210.560.89-0.14

Retrieval Method	Principle	Advantages	Disadvantages
Keyword Retrieval (BM25)	Based on term frequency and inverse document frequency	Exact matching, fast	Cannot understand semantics, fails with synonyms
Vector Retrieval	Based on cosine similarity of embedding vectors	Understands semantics, supports fuzzy matching	Less sensitive to proper nouns
Hybrid Retrieval	Fuses keyword and vector retrieval results	Balances precision and semantics	Requires weight tuning, higher complexity

Reranking

After retrieving candidate documents, a "reranking" step is usually needed. Initial retrieval focuses on recall (try not to miss anything), while reranking focuses on precision (put the most relevant at the top). Common reranking models include Cohere Rerank and BGE Reranker, which use cross-encoders to finely score query-document pairs.

4. Architecture Evolution: From Simple to Intelligent

RAG technology has gone through three generations of evolution in just two years, with each generation solving the pain points of the previous one.

最基础的 RAG 架构，流程简单直接：索引 → 检索 → 生成。适合快速原型验证，但在复杂场景下效果有限。

📄

文档加载

→

✂️

文本分块

→

🔢

向量化

→

🔍

检索

→

🤖

生成

点击流程节点查看详细说明

架构特点

✅实现简单，上手快

✅适合结构化知识库

⚠️检索质量依赖分块策略

❌无法处理复杂查询

架构演进路线

Naive RAG

2023

Advanced RAG

2024

Modular RAG

2025

Comparison of Three RAG Generations

Naive RAG (2023): The most basic "index → retrieve → generate" workflow. Simple to implement but limited effectiveness. Issues include: unstable retrieval quality, inability to handle complex queries, and easy introduction of noisy context.
Advanced RAG (2024): Built on top of Naive RAG with added query rewriting, hybrid retrieval, reranking, context compression, and other optimization steps, significantly improving retrieval precision and generation quality.
Modular RAG (2025): Decomposes RAG into pluggable modules, supporting routing decisions, adaptive retrieval, self-reflection, and other advanced capabilities. Can dynamically select the optimal processing workflow based on query type.

5. RAG vs Fine-tuning: Which Should You Choose?

When you want a large model to master domain-specific knowledge, there are usually two paths: RAG and fine-tuning. They are not mutually exclusive but complementary.

To use an analogy: Fine-tuning is like sending a student to training classes, internalizing knowledge into their brain; RAG is like giving a student reference books that they can consult during exams. Both approaches have their pros and cons; the key is your specific needs.

RAG 检索增强生成

Fine-tuning 微调

知识更新速度

实时更新，修改文档即生效

⚡

需要重新训练，周期长

实施成本

搭建检索系统，成本适中

💰

需要 GPU 资源和标注数据

回答风格控制

依赖 Prompt 工程

🎨

可深度定制输出风格

幻觉控制

有据可查，可追溯来源

🎯

仍可能产生幻觉

推理延迟

需要额外的检索步骤

⏱️

直接生成，无额外开销

私有数据安全

数据留在本地，不进入模型

🔒

数据融入模型权重

一句话总结

RAG 像是给模型配了一个实时更新的参考书库，适合知识频繁变化的场景；微调像是让模型上了一门专业课，适合需要特定风格或领域深度的场景。实际项目中，两者常常结合使用。

Dimension	RAG	Fine-tuning
Knowledge Updates	Real-time updates; just modify documents	Requires retraining
Cost	Low (no GPU training needed)	High (requires training resources)
Explainability	High (traceable sources)	Low (knowledge internalized in weights)
Applicable Scenarios	Knowledge base Q&A, document retrieval	Style transfer, specific task optimization
Hallucination Control	Better (has reference basis)	General (may still hallucinate)

Practical Advice

In most scenarios, try RAG first. RAG's advantages include: no training required, real-time knowledge updates, and traceable answer sources. Only consider fine-tuning when you need to change the model's "behavioral patterns" (such as output format, language style, or reasoning approach). The strongest solution is often a RAG + fine-tuning combination.

Summary

RAG is currently one of the most practical technologies for putting large models into production. Its core value lies in: making model answers verifiable, knowledge updateable in real-time, and hallucination effectively controlled.

Key takeaways from this chapter:

The core problem RAG solves: Outdated model knowledge, lack of private data, and tendency to hallucinate
Three-stage workflow: Indexing (offline preparation) → Retrieval (online search) → Generation (comprehensive answer)
Chunking is foundational: Chunking quality directly determines retrieval quality; choosing the right chunking strategy is crucial
Retrieval is key: Hybrid retrieval + reranking is currently the best-performing combination
Architecture is evolving: From Naive RAG to Modular RAG, systems are becoming increasingly intelligent and flexible
RAG and fine-tuning are complementary: Try RAG first in most scenarios; consider fine-tuning when you need to change model behavior

RAG: Retrieval-Augmented Generation ​

0. Overview: Why Do Large Models Need to "Look Up References"? ​

1. RAG Basic Workflow: Indexing, Retrieval, Generation ​

2. Text Chunking: Fitting the Elephant into the Refrigerator ​

3. Retrieval Techniques: How to Find the Most Relevant Content? ​

4. Architecture Evolution: From Simple to Intelligent ​

5. RAG vs Fine-tuning: Which Should You Choose? ​

Summary ​

Further Reading ​