Embedding & Vector Retrieval
Preface
How does a computer understand that "cats and dogs are similar, but neither is similar to a car"? To humans this is common sense, but to a computer, "cat", "dog", and "car" are just three unrelated strings. Embedding technology is the key to solving this problem — it turns text into numerical vectors, allowing computers to understand semantic "closeness and distance."
What will you learn in this article?
After completing this chapter, you will gain:
- Intuitive understanding: Grasp what Embedding is and why the vectors for "cat" and "dog" end up close together
- Similarity calculation: Master core metrics like cosine similarity and Euclidean distance
- Indexing principles: Understand how vector databases perform millisecond-level retrieval across millions of data points
- Technology selection: Learn the characteristics and use cases of mainstream vector databases
- End-to-end pipeline: Master the complete pipeline from text to vectors to retrieval
| Chapter | Content | Core Concepts |
|---|---|---|
| Chapter 1 | Embedding Concepts | Semantic space, vector representation |
| Chapter 2 | Similarity Calculation | Cosine similarity, Euclidean distance |
| Chapter 3 | Vector Indexing | Brute-force search vs. ANN |
| Chapter 4 | Vector Databases | Pinecone, Milvus, Chroma |
| Chapter 5 | End-to-End Pipeline | Text → Vector → Storage → Query |
0. Panorama: The Bridge from Text to Numbers
In the world of natural language processing, there is a fundamental challenge: computers only understand numbers, not text.
Early approaches assigned each word a numeric ID (One-Hot encoding), for example "cat"=001, "dog"=010, "car"=100. But this had a fatal flaw: all words are equally distant from each other. The distance from "cat" to "dog" is exactly the same as the distance from "cat" to "car" — which clearly defies our intuition.
The revolutionary insight of Embedding is this: it maps each word into a dense, low-dimensional vector space, where semantically similar words naturally cluster together. In this space, "cat" and "dog" are close neighbors, while "car" sits far away — the computer can finally "understand" semantics.
The Leap from One-Hot to Embedding
- One-Hot: Dimension = vocabulary size (potentially tens of thousands), each vector has only a single 1 and all other entries are 0 — sparse and devoid of semantics
- Embedding: Dimensions typically 768–1536, every number carries meaning — dense and rich in semantic information
- Key breakthrough: Word2Vec (2013) proved that "a word's meaning can be defined by its context," ushering in the Embedding era
1. Embedding Concepts: Turning Text into Coordinates
The core idea of Embedding can be summarized in one sentence: use a set of numbers (a vector) to represent the meaning of a word or sentence.
Imagine a two-dimensional coordinate system. Place "cat" at coordinates (0.2, 0.7), "dog" at (0.3, 0.6), and "car" at (0.9, 0.1). You'll notice "cat" and "dog" have similar coordinates, while "car" is far away from both. This is the intuition behind Embedding — semantic similarity becomes spatial distance.
Word Embedding Space Visualization
Semantically similar words stay closer in vector space and form natural clusters
💡 Embedding models map text into high-dimensional vector spaces, often 768 to 1536 dimensions. This demo simplifies that into 2D to show the core idea: semantically similar words have shorter vector distances。
Three Key Properties of Embeddings
- Semantic clustering: Words with similar meanings automatically group together (animals in one cluster, food in another, technology in another)
- Analogical relationships: Vector arithmetic can express semantic relationships — the classic example: king − man + woman ≈ queen
- Dimensional meaning: Each dimension implicitly encodes some semantic feature (e.g., "is it an animal", "size", "sentiment", etc.)
| Encoding Method | Dimensions | Semantic Information | Typical Application |
|---|---|---|---|
| One-Hot | Vocabulary size (~50,000) | None | Traditional NLP |
| Word2Vec | 100–300 | Word-level semantics | Word similarity, analogical reasoning |
| BERT Embedding | 768 | Contextual semantics | Sentence understanding, QA |
| OpenAI text-embedding-3 | 1536–3072 | Deep semantics | RAG, semantic search |
2. Similarity Calculation: How "Close" Are Two Vectors?
With vector representations in hand, the next question is naturally: how do we measure how similar two vectors are? It's like measuring how close two cities are on a map — you can measure the straight-line distance, or you can check whether they point in the same direction.
Vector Similarity Calculator
Drag vector endpoints to observe similarity metrics in real time
💡Cosine similarityfocuses only on direction and is useful for semantic text comparison; Euclidean distanceconsiders both direction and magnitude and fits absolute-distance scenarios.
Two Core Metrics
- Cosine Similarity: Measures whether two vectors point in the same direction. Range: [-1, 1]. 1 means identical direction, 0 means orthogonal (unrelated), -1 means completely opposite. The go-to choice for text semantic comparison because it's unaffected by vector magnitude.
- Euclidean Distance: Measures the straight-line distance between the endpoints of two vectors. Range: [0, ∞). 0 means they coincide exactly; the larger the value, the less similar. Suitable for scenarios where "absolute magnitude" matters.
| Metric | Formula Intuition | Range | Use Cases |
|---|---|---|---|
| Cosine Similarity | Compares direction, ignores magnitude | [-1, 1] | Text semantic search, recommendation systems |
| Euclidean Distance | Straight-line distance between endpoints | [0, ∞) | Image features, cluster analysis |
| Dot Product | Direction × Magnitude | (-∞, +∞) | Fast computation with normalized vectors |
| Manhattan Distance | Distance traveled along coordinate axes | [0, ∞) | High-dimensional sparse vectors |
3. Vector Indexing: How to Search Millions of Vectors in Milliseconds?
Suppose you have 1 million documents, each converted into a 1536-dimensional vector. A user asks a question, and you need to find the 10 most similar ones. The most direct approach is to compute similarity one by one — but that means performing 1 million 1536-dimensional vector operations, which is far too slow.
This is the problem that vector indexing solves: trade space for time by building index structures through preprocessing, reducing retrieval speed from O(n) to approximately O(log n).
Vector Index Strategy Comparison
Compare brute-force search with approximate nearest neighbor search
| Strategy | Time complexity | Accuracy | Use case |
|---|---|---|---|
| Brute force | O(n) | 100% | Small datasets (<10K) |
| ANN (IVF) | O(n/k) | ~95% | Large datasets (>100K) |
| HNSW | O(log n) | ~98% | High-performance retrieval |
Brute-Force Search vs. Approximate Nearest Neighbors (ANN)
- Brute-force (Flat): Compare one by one. 100% accurate but slow. Suitable for small datasets (< 100k).
- IVF (Inverted File Index): Partition the vector space into regions (clusters) first, then search only the nearest few regions during querying. Like organizing a library by subject — you only go to the relevant sections when looking for a book.
- HNSW (Hierarchical Navigable Small World): Build a multi-layer graph structure, navigating from coarse to fine granularity layer by layer. Like first looking at a world map to locate a country, then a provincial map, and finally a street map.
- PQ (Product Quantization): Compress high-dimensional vectors into short codes, sacrificing a small amount of precision for significant memory savings. Suitable for ultra-large-scale datasets.
| Index Type | Build Speed | Query Speed | Recall | Memory Usage | Suitable Scale |
|---|---|---|---|---|---|
| Flat (Brute-force) | No build needed | Slow | 100% | High | < 100k |
| IVF | Moderate | Fast | 95%+ | Medium | 100k–10M |
| HNSW | Slow | Very fast | 99%+ | High | 100k–10M |
| PQ | Moderate | Fast | 90%+ | Very low | > 10M |
| IVF-PQ | Moderate | Fast | 92%+ | Low | > 100M |
4. Vector Databases: Storage Engines Purpose-Built for Vectors
With vectors and indexing algorithms in hand, you need a place to store and manage them. Traditional databases (MySQL, PostgreSQL) excel at structured data but struggle with high-dimensional vector similarity search. Vector databases are designed specifically for this scenario.
Mainstream Vector Database Comparison
Click a card to see details and compare use cases across vector databases
Scenario recommendations
Core Capabilities of Vector Databases
- Efficient storage: Storage formats optimized for high-dimensional floating-point vectors
- ANN retrieval: Built-in support for multiple approximate nearest neighbor indexing algorithms (HNSW, IVF, etc.)
- Metadata filtering: Support for filtering by tags, timestamps, and other criteria alongside vector search
- Real-time updates: Support for dynamic insertion, deletion, and modification of vectors without rebuilding the entire index
- Horizontal scaling: Distributed architecture supporting billion-scale vector collections
| Database | Type | Characteristics | Use Cases |
|---|---|---|---|
| Pinecone | Fully-managed cloud service | Zero ops, works out of the box | Rapid prototyping, small-to-medium production |
| Milvus | Open-source distributed | High performance, scalable | Large-scale production environments |
| Chroma | Open-source lightweight | Embeddable, clean API | Local development, small projects |
| Weaviate | Open-source cloud-native | Built-in vectorization, GraphQL | Scenarios requiring automatic vectorization |
| Qdrant | Open-source high-performance | Rust implementation, strong filtering | Scenarios requiring complex filtering |
| pgvector | PostgreSQL extension | Reuses existing PG infrastructure | Teams already using PostgreSQL |
5. End-to-End Pipeline: The Complete Flow from Text to Retrieval
Now that we understand each component, let's wire them together and see how a complete vector retrieval system works.
The entire flow splits into two tracks: offline ingestion (turning documents into vectors and storing them) and online querying (turning questions into vectors and searching).
Embedding Generation Pipeline
Step through the full conversion from text to vector
Offline Ingestion Pipeline
- Document loading: Read raw text from various sources (PDFs, web pages, databases)
- Text preprocessing: Clean, denoise, and normalize (strip HTML tags, special characters, etc.)
- Text chunking: Split long text into appropriately sized chunks according to a strategy (200–500 tokens)
- Vectorization: Call an embedding model (e.g., OpenAI text-embedding-3-small) to convert each chunk into a vector
- Store in vector database: Write the vectors along with the original text and metadata into the database
Online Query Pipeline
- Receive query: User inputs a natural language question
- Query vectorization: Convert the question into a vector using the same embedding model
- Similarity retrieval: Search the vector database for the Top-K most similar document chunks
- Post-processing: Rerank, deduplicate, filter by metadata
- Return results: Return the most relevant document chunks to the caller (or pass them to an LLM to generate an answer)
| Stage | Key Decision | Recommended Approach |
|---|---|---|
| Embedding Model | Accuracy vs. cost vs. speed | OpenAI text-embedding-3-small (best value) |
| Chunking Strategy | Granularity vs. semantic integrity | Recursive chunking, 200–500 tokens |
| Vector Database | Scale vs. operational cost | Chroma for small projects, Pinecone/Milvus for production |
| Similarity Metric | Semantic vs. exact | Cosine similarity (go-to for text scenarios) |
| Top-K Value | Recall vs. noise | Retrieve 20 first, rerank and take Top 5 |
Summary
Embedding and vector retrieval form the bridge connecting "human language" and "machine understanding." They are also the foundational infrastructure for AI applications such as RAG, semantic search, and recommendation systems.
Key takeaways from this chapter:
- The essence of Embedding: Mapping text into a high-dimensional vector space, turning semantic similarity into spatial distance
- Similarity metrics: Cosine similarity focuses on direction (ideal for text), Euclidean distance focuses on absolute distance
- Indexing is the performance key: HNSW and IVF reduce retrieval across millions of vectors to milliseconds
- Vector database selection: Chroma/pgvector for small projects, Pinecone/Milvus for production
- End-to-end thinking: From document loading to final retrieval, every stage's choices affect the final outcome
Further Reading
- OpenAI Embeddings Documentation - Official embedding model usage guide
- Pinecone Learning Center - Systematic tutorials on vector databases and retrieval
- FAISS Wiki - Documentation for Facebook's open-source vector retrieval library
- Word2Vec Original Paper - The seminal work that launched the Embedding era
- MTEB Leaderboard - Embedding model performance comparison leaderboard