Skip to content

Search Engine Fundamentals

Introduction

You search for "red dress" on Taobao and find the most relevant results from billions of products in 0.1 seconds — how is this possible? Search engines are one of the internet's most critical infrastructure components. From Google to e-commerce site search, the core principles are the same: inverted index + relevance ranking.

What will you learn in this article?

After reading this chapter, you will gain:

  • Inverted index: Understand the core data structure of search engines
  • Tokenization technology: Learn about the challenges and common solutions for Chinese word segmentation
  • Relevance ranking: Master the basics of TF-IDF and BM25
  • Elasticsearch: Understand the architecture and use cases of the most popular search engine
  • Search optimization: Master practical search features like synonyms, spell correction, and highlighting
ChapterContentCore Concepts
Chapter 1Inverted indexForward index vs inverted index
Chapter 2Tokenization and analysisChinese word segmentation, stop words, stemming
Chapter 3Relevance rankingTF-IDF, BM25
Chapter 4ElasticsearchDistributed architecture, shards, replicas
Chapter 5Search optimizationSynonyms, spell correction, autocomplete

The essence of search is an information retrieval problem: given a query, find the most relevant results from a massive collection of documents and return them sorted by relevance.

This process has two phases:

  • Indexing phase (offline): Pre-process all documents and build efficient lookup structures
  • Query phase (online): When a user enters keywords, quickly find matching documents and rank them

Why Not Use Database LIKE Queries?

SELECT * FROM products WHERE name LIKE '%red dress%' might seem like it could work for search, but it requires a full table scan — checking each row one by one. When data reaches millions of records, this query becomes unusably slow. Inverted indexes turn this O(n) operation into an O(1) lookup.


1. Inverted Index: The "Heart" of Search Engines

Traditional databases use forward indexes: from document ID to document content. Search engines use inverted indexes: from keywords to the list of documents containing them.

Inverted Index
Type a search term to see how an inverted index works
Source documents
Doc 1Apple is a common fruit
Doc 2Apple released a new phone
Doc 3I like eating fruit and vegetables
Doc 4This phone has a practical price
Doc 5The fruit shop has apples and bananas
Inverted index table
apple[1][2][5]
fruit[1][3][5]
phone[2][4]
company[2]
release[2]
like[3]
vegetables[3]
price[4]
practical[4]
banana[5]
common[1]
Index TypeDirectionLookup MethodUse Case
Forward indexDocument → ContentKnow the ID, look up contentDatabase primary key queries
Inverted indexKeyword → Document listKnow the keyword, look up documentsFull-text search

Inverted Index Construction Process

  1. Document collection: Gather all documents that need to be searchable
  2. Tokenization: Split documents into individual terms
  3. Build mapping: Record which documents each term appears in (along with position, frequency, etc.)
  4. Persist storage: Write the index to disk for fast lookup

2. Tokenization and Text Analysis

Tokenization is the first step in search engines and the biggest challenge for Chinese search. English naturally separates words with spaces, but Chinese has no delimiters — "乒乓球拍卖了" could be segmented as "乒乓球/拍卖/了" or "乒乓/球拍/卖/了".

Tokenization MethodDescriptionExample
Standard tokenizerSplit by spaces and punctuation (English)"hello world" → ["hello", "world"]
Chinese tokenizerSegment based on dictionaries or models"搜索引擎" → ["搜索", "引擎"]
N-gramSliding window of fixed length"搜索" → ["搜索", "索引"]
Custom dictionaryAdd business-specific terms"iPhone16ProMax" as a single term

Text Analysis Pipeline

Tokenization is just one step in text analysis. The complete pipeline includes:

  1. Character filtering: Remove HTML tags, special characters
  2. Tokenization: Split text into tokens
  3. Stop word filtering: Remove meaningless high-frequency words like "的", "了", "是"
  4. Synonym expansion: Expand "手机" (mobile phone) to "手机、电话、移动电话"
  5. Stemming: Reduce "running" to "run" (English)

3. Relevance Ranking: Which Result Is Most "Relevant"?

Finding matching documents is just the first step; more importantly, ranking — placing the most relevant results at the top.

AlgorithmPrincipleCharacteristics
TF-IDFTerm Frequency (TF) × Inverse Document Frequency (IDF)Classic algorithm, simple and effective
BM25Improved version of TF-IDF, adding document length normalizationElasticsearch's default algorithm
Vector searchConvert documents and queries to vectors, compute cosine similaritySupports semantic search

Intuitive Understanding of TF-IDF

  • TF (Term Frequency): The more times a term appears in a document, the more likely the document is relevant to that term
  • IDF (Inverse Document Frequency): The fewer documents a term appears in, the higher its discriminative power
  • "的" appears in all documents (low IDF), so searching for "的" is meaningless
  • "Elasticsearch" appears in only a few documents (high IDF), so searching for it precisely locates relevant content

Elasticsearch is currently the most popular open-source search engine, built on Apache Lucene, providing distributed, RESTful API-based full-text search capabilities.

ConceptDescription
IndexSimilar to a database "table," storing documents of the same type
DocumentA single record, in JSON format
ShardA partition, splitting an index across multiple nodes
ReplicaA copy, providing high availability and read scaling
MappingField type definitions, similar to a database schema
AnalyzerText analyzer, defining tokenization rules

ES vs Database

Elasticsearch is not meant to replace databases; it works alongside them as a search layer. Typical architecture: data is written to the database → synced to ES → search requests go to ES → detail requests go to the database.


5. Search Optimization: Making Search "Smarter"

Optimization MethodDescriptionEffect
SynonymsSearching "手机" (mobile) also finds "电话" (phone)Improves recall
Spell correction"iphoen" auto-corrected to "iphone"Fault tolerance
AutocompleteTyping "苹" suggests "苹果手机" (Apple phone)Better UX
HighlightingMatching words shown in red in search resultsVisual clarity
Weight adjustmentTitle match weight > content match weightImproves precision
Filtering and aggregationFilter by price range, brandNarrow results

Summary

Search engines are core infrastructure for internet applications. Understanding inverted indexes, tokenization, and relevance ranking — these three core concepts — means you've grasped the essence of search engines.

Key takeaways from this chapter:

  1. Inverted index: The reverse mapping from keywords to documents is the core data structure of search engines
  2. Tokenization is foundational: Chinese word segmentation is key to search quality; choosing the right tokenizer is essential
  3. BM25 ranking: Relevance scoring based on term frequency and document frequency is ES's default algorithm
  4. ES architecture: Shards + replicas enable distributed processing and high availability
  5. Search optimization: Synonyms, spell correction, and autocomplete make search smarter

Further Reading