Embedding & Vector Retrieval

Preface

How does a computer understand that "cats and dogs are similar, but neither is similar to a car"? To humans this is common sense, but to a computer, "cat", "dog", and "car" are just three unrelated strings. Embedding technology is the key to solving this problem — it turns text into numerical vectors, allowing computers to understand semantic "closeness and distance."

What will you learn in this article?

After completing this chapter, you will gain:

Intuitive understanding: Grasp what Embedding is and why the vectors for "cat" and "dog" end up close together
Similarity calculation: Master core metrics like cosine similarity and Euclidean distance
Indexing principles: Understand how vector databases perform millisecond-level retrieval across millions of data points
Technology selection: Learn the characteristics and use cases of mainstream vector databases
End-to-end pipeline: Master the complete pipeline from text to vectors to retrieval

Chapter	Content	Core Concepts
Chapter 1	Embedding Concepts	Semantic space, vector representation
Chapter 2	Similarity Calculation	Cosine similarity, Euclidean distance
Chapter 3	Vector Indexing	Brute-force search vs. ANN
Chapter 4	Vector Databases	Pinecone, Milvus, Chroma
Chapter 5	End-to-End Pipeline	Text → Vector → Storage → Query

0. Panorama: The Bridge from Text to Numbers

In the world of natural language processing, there is a fundamental challenge: computers only understand numbers, not text.

Early approaches assigned each word a numeric ID (One-Hot encoding), for example "cat"=001, "dog"=010, "car"=100. But this had a fatal flaw: all words are equally distant from each other. The distance from "cat" to "dog" is exactly the same as the distance from "cat" to "car" — which clearly defies our intuition.

The revolutionary insight of Embedding is this: it maps each word into a dense, low-dimensional vector space, where semantically similar words naturally cluster together. In this space, "cat" and "dog" are close neighbors, while "car" sits far away — the computer can finally "understand" semantics.

The Leap from One-Hot to Embedding

One-Hot: Dimension = vocabulary size (potentially tens of thousands), each vector has only a single 1 and all other entries are 0 — sparse and devoid of semantics
Embedding: Dimensions typically 768–1536, every number carries meaning — dense and rich in semantic information
Key breakthrough: Word2Vec (2013) proved that "a word's meaning can be defined by its context," ushering in the Embedding era

1. Embedding Concepts: Turning Text into Coordinates

The core idea of Embedding can be summarized in one sentence: use a set of numbers (a vector) to represent the meaning of a word or sentence.

Imagine a two-dimensional coordinate system. Place "cat" at coordinates (0.2, 0.7), "dog" at (0.3, 0.6), and "car" at (0.9, 0.1). You'll notice "cat" and "dog" have similar coordinates, while "car" is far away from both. This is the intuition behind Embedding — semantic similarity becomes spatial distance.

Word Embedding Space Visualization

Semantically similar words stay closer in vector space and form natural clusters

💡 Embedding models map text into high-dimensional vector spaces, often 768 to 1536 dimensions. This demo simplifies that into 2D to show the core idea: semantically similar words have shorter vector distances。

Three Key Properties of Embeddings

Semantic clustering: Words with similar meanings automatically group together (animals in one cluster, food in another, technology in another)
Analogical relationships: Vector arithmetic can express semantic relationships — the classic example: king − man + woman ≈ queen
Dimensional meaning: Each dimension implicitly encodes some semantic feature (e.g., "is it an animal", "size", "sentiment", etc.)

Encoding Method	Dimensions	Semantic Information	Typical Application
One-Hot	Vocabulary size (~50,000)	None	Traditional NLP
Word2Vec	100–300	Word-level semantics	Word similarity, analogical reasoning
BERT Embedding	768	Contextual semantics	Sentence understanding, QA
OpenAI text-embedding-3	1536–3072	Deep semantics	RAG, semantic search

2. Similarity Calculation: How "Close" Are Two Vectors?

With vector representations in hand, the next question is naturally: how do we measure how similar two vectors are? It's like measuring how close two cities are on a map — you can measure the straight-line distance, or you can check whether they point in the same direction.

Vector Similarity Calculator

Drag vector endpoints to observe similarity metrics in real time

Cosine similarity
0.3969
-1 (opposite) ~ 1 (same)

Euclidean distance

1.71

0 (identical) ~ ∞ (far away)

Dot product

0.96

dot(A, B) = |A||B|cosθ

💡Cosine similarityfocuses only on direction and is useful for semantic text comparison; Euclidean distanceconsiders both direction and magnitude and fits absolute-distance scenarios.

Two Core Metrics

Cosine Similarity: Measures whether two vectors point in the same direction. Range: [-1, 1]. 1 means identical direction, 0 means orthogonal (unrelated), -1 means completely opposite. The go-to choice for text semantic comparison because it's unaffected by vector magnitude.
Euclidean Distance: Measures the straight-line distance between the endpoints of two vectors. Range: [0, ∞). 0 means they coincide exactly; the larger the value, the less similar. Suitable for scenarios where "absolute magnitude" matters.

Metric	Formula Intuition	Range	Use Cases
Cosine Similarity	Compares direction, ignores magnitude	[-1, 1]	Text semantic search, recommendation systems
Euclidean Distance	Straight-line distance between endpoints	[0, ∞)	Image features, cluster analysis
Dot Product	Direction × Magnitude	(-∞, +∞)	Fast computation with normalized vectors
Manhattan Distance	Distance traveled along coordinate axes	[0, ∞)	High-dimensional sparse vectors

3. Vector Indexing: How to Search Millions of Vectors in Milliseconds?

Suppose you have 1 million documents, each converted into a 1536-dimensional vector. A user asks a question, and you need to find the 10 most similar ones. The most direct approach is to compute similarity one by one — but that means performing 1 million 1536-dimensional vector operations, which is far too slow.

This is the problem that vector indexing solves: trade space for time by building index structures through preprocessing, reducing retrieval speed from O(n) to approximately O(log n).

Vector Index Strategy Comparison

Compare brute-force search with approximate nearest neighbor search

Total points

Visited nodes

Search effort

Nearest K found

Strategy	Time complexity	Accuracy	Use case
Brute force	`O(n)`	100%	Small datasets (<10K)
ANN (IVF)	`O(n/k)`	~95%	Large datasets (>100K)
HNSW	`O(log n)`	~98%	High-performance retrieval

Brute-Force Search vs. Approximate Nearest Neighbors (ANN)

Brute-force (Flat): Compare one by one. 100% accurate but slow. Suitable for small datasets (< 100k).
IVF (Inverted File Index): Partition the vector space into regions (clusters) first, then search only the nearest few regions during querying. Like organizing a library by subject — you only go to the relevant sections when looking for a book.
HNSW (Hierarchical Navigable Small World): Build a multi-layer graph structure, navigating from coarse to fine granularity layer by layer. Like first looking at a world map to locate a country, then a provincial map, and finally a street map.
PQ (Product Quantization): Compress high-dimensional vectors into short codes, sacrificing a small amount of precision for significant memory savings. Suitable for ultra-large-scale datasets.

Index Type	Build Speed	Query Speed	Recall	Memory Usage	Suitable Scale
Flat (Brute-force)	No build needed	Slow	100%	High	< 100k
IVF	Moderate	Fast	95%+	Medium	100k–10M
HNSW	Slow	Very fast	99%+	High	100k–10M
PQ	Moderate	Fast	90%+	Very low	> 10M
IVF-PQ	Moderate	Fast	92%+	Low	> 100M

4. Vector Databases: Storage Engines Purpose-Built for Vectors

With vectors and indexing algorithms in hand, you need a place to store and manage them. Traditional databases (MySQL, PostgreSQL) excel at structured data but struggle with high-dimensional vector similarity search. Vector databases are designed specifically for this scenario.

Mainstream Vector Database Comparison

Click a card to see details and compare use cases across vector databases

Pinecone

Managed cloud service

Cloud nativeServerless

Performance

Ease of use

Scalability

Milvus

Open-source distributed

Open sourceDistributedHigh performance

Performance

Ease of use

Scalability

Weaviate

Open-source AI native

Open sourceGraphQLModular

Performance

Ease of use

Scalability

Chroma

Lightweight embedded

Open sourceLightweightPython

Performance

Ease of use

Scalability

pgvector

PostgreSQL extension

SQLPostgreSQLExtension

Performance

Ease of use

Scalability

Scenario recommendations

🚀

Fast prototype

Chroma / Pinecone

🏢

Enterprise deployment

Milvus / Weaviate

💾

Existing PG database

pgvector

🤖

RAG app

Chroma / Weaviate

Core Capabilities of Vector Databases

Efficient storage: Storage formats optimized for high-dimensional floating-point vectors
ANN retrieval: Built-in support for multiple approximate nearest neighbor indexing algorithms (HNSW, IVF, etc.)
Metadata filtering: Support for filtering by tags, timestamps, and other criteria alongside vector search
Real-time updates: Support for dynamic insertion, deletion, and modification of vectors without rebuilding the entire index
Horizontal scaling: Distributed architecture supporting billion-scale vector collections

Database	Type	Characteristics	Use Cases
Pinecone	Fully-managed cloud service	Zero ops, works out of the box	Rapid prototyping, small-to-medium production
Milvus	Open-source distributed	High performance, scalable	Large-scale production environments
Chroma	Open-source lightweight	Embeddable, clean API	Local development, small projects
Weaviate	Open-source cloud-native	Built-in vectorization, GraphQL	Scenarios requiring automatic vectorization
Qdrant	Open-source high-performance	Rust implementation, strong filtering	Scenarios requiring complex filtering
pgvector	PostgreSQL extension	Reuses existing PG infrastructure	Teams already using PostgreSQL

5. End-to-End Pipeline: The Complete Flow from Text to Retrieval

Now that we understand each component, let's wire them together and see how a complete vector retrieval system works.

The entire flow splits into two tracks: offline ingestion (turning documents into vectors and storing them) and online querying (turning questions into vectors and searching).

Embedding Generation Pipeline

Step through the full conversion from text to vector

Input text

Tokenize

↓

Encode

↓

Model inference

↓

Pooling

↓

Normalize

Offline Ingestion Pipeline

Document loading: Read raw text from various sources (PDFs, web pages, databases)
Text preprocessing: Clean, denoise, and normalize (strip HTML tags, special characters, etc.)
Text chunking: Split long text into appropriately sized chunks according to a strategy (200–500 tokens)
Vectorization: Call an embedding model (e.g., OpenAI text-embedding-3-small) to convert each chunk into a vector
Store in vector database: Write the vectors along with the original text and metadata into the database

Online Query Pipeline

Receive query: User inputs a natural language question
Query vectorization: Convert the question into a vector using the same embedding model
Similarity retrieval: Search the vector database for the Top-K most similar document chunks
Post-processing: Rerank, deduplicate, filter by metadata
Return results: Return the most relevant document chunks to the caller (or pass them to an LLM to generate an answer)

Stage	Key Decision	Recommended Approach
Embedding Model	Accuracy vs. cost vs. speed	OpenAI text-embedding-3-small (best value)
Chunking Strategy	Granularity vs. semantic integrity	Recursive chunking, 200–500 tokens
Vector Database	Scale vs. operational cost	Chroma for small projects, Pinecone/Milvus for production
Similarity Metric	Semantic vs. exact	Cosine similarity (go-to for text scenarios)
Top-K Value	Recall vs. noise	Retrieve 20 first, rerank and take Top 5

Summary

Embedding and vector retrieval form the bridge connecting "human language" and "machine understanding." They are also the foundational infrastructure for AI applications such as RAG, semantic search, and recommendation systems.

Key takeaways from this chapter:

The essence of Embedding: Mapping text into a high-dimensional vector space, turning semantic similarity into spatial distance
Similarity metrics: Cosine similarity focuses on direction (ideal for text), Euclidean distance focuses on absolute distance
Indexing is the performance key: HNSW and IVF reduce retrieval across millions of vectors to milliseconds
Vector database selection: Chroma/pgvector for small projects, Pinecone/Milvus for production
End-to-end thinking: From document loading to final retrieval, every stage's choices affect the final outcome

Embedding & Vector Retrieval ​

0. Panorama: The Bridge from Text to Numbers ​

1. Embedding Concepts: Turning Text into Coordinates ​

Word Embedding Space Visualization

2. Similarity Calculation: How "Close" Are Two Vectors? ​

Vector Similarity Calculator

3. Vector Indexing: How to Search Millions of Vectors in Milliseconds? ​

Vector Index Strategy Comparison

4. Vector Databases: Storage Engines Purpose-Built for Vectors ​

Mainstream Vector Database Comparison

Scenario recommendations

5. End-to-End Pipeline: The Complete Flow from Text to Retrieval ​

Embedding Generation Pipeline

Summary ​

Further Reading ​

Embedding & Vector Retrieval

0. Panorama: The Bridge from Text to Numbers

1. Embedding Concepts: Turning Text into Coordinates

2. Similarity Calculation: How "Close" Are Two Vectors?

3. Vector Indexing: How to Search Millions of Vectors in Milliseconds?

4. Vector Databases: Storage Engines Purpose-Built for Vectors

5. End-to-End Pipeline: The Complete Flow from Text to Retrieval

Summary

Further Reading