Skip to content

Embedding & Vector Retrieval

Preface

How does a computer understand that "cats and dogs are similar, but neither is similar to a car"? To humans this is common sense, but to a computer, "cat", "dog", and "car" are just three unrelated strings. Embedding technology is the key to solving this problem — it turns text into numerical vectors, allowing computers to understand semantic "closeness and distance."

What will you learn in this article?

After completing this chapter, you will gain:

  • Intuitive understanding: Grasp what Embedding is and why the vectors for "cat" and "dog" end up close together
  • Similarity calculation: Master core metrics like cosine similarity and Euclidean distance
  • Indexing principles: Understand how vector databases perform millisecond-level retrieval across millions of data points
  • Technology selection: Learn the characteristics and use cases of mainstream vector databases
  • End-to-end pipeline: Master the complete pipeline from text to vectors to retrieval
ChapterContentCore Concepts
Chapter 1Embedding ConceptsSemantic space, vector representation
Chapter 2Similarity CalculationCosine similarity, Euclidean distance
Chapter 3Vector IndexingBrute-force search vs. ANN
Chapter 4Vector DatabasesPinecone, Milvus, Chroma
Chapter 5End-to-End PipelineText → Vector → Storage → Query

0. Panorama: The Bridge from Text to Numbers

In the world of natural language processing, there is a fundamental challenge: computers only understand numbers, not text.

Early approaches assigned each word a numeric ID (One-Hot encoding), for example "cat"=001, "dog"=010, "car"=100. But this had a fatal flaw: all words are equally distant from each other. The distance from "cat" to "dog" is exactly the same as the distance from "cat" to "car" — which clearly defies our intuition.

The revolutionary insight of Embedding is this: it maps each word into a dense, low-dimensional vector space, where semantically similar words naturally cluster together. In this space, "cat" and "dog" are close neighbors, while "car" sits far away — the computer can finally "understand" semantics.

The Leap from One-Hot to Embedding

  • One-Hot: Dimension = vocabulary size (potentially tens of thousands), each vector has only a single 1 and all other entries are 0 — sparse and devoid of semantics
  • Embedding: Dimensions typically 768–1536, every number carries meaning — dense and rich in semantic information
  • Key breakthrough: Word2Vec (2013) proved that "a word's meaning can be defined by its context," ushering in the Embedding era

1. Embedding Concepts: Turning Text into Coordinates

The core idea of Embedding can be summarized in one sentence: use a set of numbers (a vector) to represent the meaning of a word or sentence.

Imagine a two-dimensional coordinate system. Place "cat" at coordinates (0.2, 0.7), "dog" at (0.3, 0.6), and "car" at (0.9, 0.1). You'll notice "cat" and "dog" have similar coordinates, while "car" is far away from both. This is the intuition behind Embedding — semantic similarity becomes spatial distance.

Word Embedding Space Visualization

Semantically similar words stay closer in vector space and form natural clusters

Dimension 1Dimension 2CatDogTigerRabbitKingQueenPrincePrincessAnimalsRoyalty

💡 Embedding models map text into high-dimensional vector spaces, often 768 to 1536 dimensions. This demo simplifies that into 2D to show the core idea: semantically similar words have shorter vector distances

Three Key Properties of Embeddings

  1. Semantic clustering: Words with similar meanings automatically group together (animals in one cluster, food in another, technology in another)
  2. Analogical relationships: Vector arithmetic can express semantic relationships — the classic example: king − man + woman ≈ queen
  3. Dimensional meaning: Each dimension implicitly encodes some semantic feature (e.g., "is it an animal", "size", "sentiment", etc.)
Encoding MethodDimensionsSemantic InformationTypical Application
One-HotVocabulary size (~50,000)NoneTraditional NLP
Word2Vec100–300Word-level semanticsWord similarity, analogical reasoning
BERT Embedding768Contextual semanticsSentence understanding, QA
OpenAI text-embedding-31536–3072Deep semanticsRAG, semantic search

2. Similarity Calculation: How "Close" Are Two Vectors?

With vector representations in hand, the next question is naturally: how do we measure how similar two vectors are? It's like measuring how close two cities are on a map — you can measure the straight-line distance, or you can check whether they point in the same direction.

Vector Similarity Calculator

Drag vector endpoints to observe similarity metrics in real time

AB
Cosine similarity
0.3969
-1 (opposite) ~ 1 (same)
Euclidean distance
1.71
0 (identical) ~ ∞ (far away)
Dot product
0.96
dot(A, B) = |A||B|cosθ

💡Cosine similarityfocuses only on direction and is useful for semantic text comparison; Euclidean distanceconsiders both direction and magnitude and fits absolute-distance scenarios.

Two Core Metrics

  • Cosine Similarity: Measures whether two vectors point in the same direction. Range: [-1, 1]. 1 means identical direction, 0 means orthogonal (unrelated), -1 means completely opposite. The go-to choice for text semantic comparison because it's unaffected by vector magnitude.
  • Euclidean Distance: Measures the straight-line distance between the endpoints of two vectors. Range: [0, ∞). 0 means they coincide exactly; the larger the value, the less similar. Suitable for scenarios where "absolute magnitude" matters.
MetricFormula IntuitionRangeUse Cases
Cosine SimilarityCompares direction, ignores magnitude[-1, 1]Text semantic search, recommendation systems
Euclidean DistanceStraight-line distance between endpoints[0, ∞)Image features, cluster analysis
Dot ProductDirection × Magnitude(-∞, +∞)Fast computation with normalized vectors
Manhattan DistanceDistance traveled along coordinate axes[0, ∞)High-dimensional sparse vectors

3. Vector Indexing: How to Search Millions of Vectors in Milliseconds?

Suppose you have 1 million documents, each converted into a 1536-dimensional vector. A user asks a question, and you need to find the 10 most similar ones. The most direct approach is to compute similarity one by one — but that means performing 1 million 1536-dimensional vector operations, which is far too slow.

This is the problem that vector indexing solves: trade space for time by building index structures through preprocessing, reducing retrieval speed from O(n) to approximately O(log n).

Vector Index Strategy Comparison

Compare brute-force search with approximate nearest neighbor search

Query point
Total points
60
Visited nodes
0
Search effort
0%
Nearest K found
0
StrategyTime complexityAccuracyUse case
Brute forceO(n)100%Small datasets (<10K)
ANN (IVF)O(n/k)~95%Large datasets (>100K)
HNSWO(log n)~98%High-performance retrieval

Brute-Force Search vs. Approximate Nearest Neighbors (ANN)

  • Brute-force (Flat): Compare one by one. 100% accurate but slow. Suitable for small datasets (< 100k).
  • IVF (Inverted File Index): Partition the vector space into regions (clusters) first, then search only the nearest few regions during querying. Like organizing a library by subject — you only go to the relevant sections when looking for a book.
  • HNSW (Hierarchical Navigable Small World): Build a multi-layer graph structure, navigating from coarse to fine granularity layer by layer. Like first looking at a world map to locate a country, then a provincial map, and finally a street map.
  • PQ (Product Quantization): Compress high-dimensional vectors into short codes, sacrificing a small amount of precision for significant memory savings. Suitable for ultra-large-scale datasets.
Index TypeBuild SpeedQuery SpeedRecallMemory UsageSuitable Scale
Flat (Brute-force)No build neededSlow100%High< 100k
IVFModerateFast95%+Medium100k–10M
HNSWSlowVery fast99%+High100k–10M
PQModerateFast90%+Very low> 10M
IVF-PQModerateFast92%+Low> 100M

4. Vector Databases: Storage Engines Purpose-Built for Vectors

With vectors and indexing algorithms in hand, you need a place to store and manage them. Traditional databases (MySQL, PostgreSQL) excel at structured data but struggle with high-dimensional vector similarity search. Vector databases are designed specifically for this scenario.

Mainstream Vector Database Comparison

Click a card to see details and compare use cases across vector databases

P
Pinecone
Managed cloud service
Cloud nativeServerless
Performance
Ease of use
Scalability
M
Milvus
Open-source distributed
Open sourceDistributedHigh performance
Performance
Ease of use
Scalability
W
Weaviate
Open-source AI native
Open sourceGraphQLModular
Performance
Ease of use
Scalability
C
Chroma
Lightweight embedded
Open sourceLightweightPython
Performance
Ease of use
Scalability
pg
pgvector
PostgreSQL extension
SQLPostgreSQLExtension
Performance
Ease of use
Scalability
Scenario recommendations
&#x1F680;
Fast prototype
Chroma / Pinecone
&#x1F3E2;
Enterprise deployment
Milvus / Weaviate
&#x1F4BE;
Existing PG database
pgvector
&#x1F916;
RAG app
Chroma / Weaviate

Core Capabilities of Vector Databases

  1. Efficient storage: Storage formats optimized for high-dimensional floating-point vectors
  2. ANN retrieval: Built-in support for multiple approximate nearest neighbor indexing algorithms (HNSW, IVF, etc.)
  3. Metadata filtering: Support for filtering by tags, timestamps, and other criteria alongside vector search
  4. Real-time updates: Support for dynamic insertion, deletion, and modification of vectors without rebuilding the entire index
  5. Horizontal scaling: Distributed architecture supporting billion-scale vector collections
DatabaseTypeCharacteristicsUse Cases
PineconeFully-managed cloud serviceZero ops, works out of the boxRapid prototyping, small-to-medium production
MilvusOpen-source distributedHigh performance, scalableLarge-scale production environments
ChromaOpen-source lightweightEmbeddable, clean APILocal development, small projects
WeaviateOpen-source cloud-nativeBuilt-in vectorization, GraphQLScenarios requiring automatic vectorization
QdrantOpen-source high-performanceRust implementation, strong filteringScenarios requiring complex filtering
pgvectorPostgreSQL extensionReuses existing PG infrastructureTeams already using PostgreSQL

5. End-to-End Pipeline: The Complete Flow from Text to Retrieval

Now that we understand each component, let's wire them together and see how a complete vector retrieval system works.

The entire flow splits into two tracks: offline ingestion (turning documents into vectors and storing them) and online querying (turning questions into vectors and searching).

Embedding Generation Pipeline

Step through the full conversion from text to vector

1
Tokenize
2
Encode
3
Model inference
4
Pooling
5
Normalize

Offline Ingestion Pipeline

  1. Document loading: Read raw text from various sources (PDFs, web pages, databases)
  2. Text preprocessing: Clean, denoise, and normalize (strip HTML tags, special characters, etc.)
  3. Text chunking: Split long text into appropriately sized chunks according to a strategy (200–500 tokens)
  4. Vectorization: Call an embedding model (e.g., OpenAI text-embedding-3-small) to convert each chunk into a vector
  5. Store in vector database: Write the vectors along with the original text and metadata into the database

Online Query Pipeline

  1. Receive query: User inputs a natural language question
  2. Query vectorization: Convert the question into a vector using the same embedding model
  3. Similarity retrieval: Search the vector database for the Top-K most similar document chunks
  4. Post-processing: Rerank, deduplicate, filter by metadata
  5. Return results: Return the most relevant document chunks to the caller (or pass them to an LLM to generate an answer)
StageKey DecisionRecommended Approach
Embedding ModelAccuracy vs. cost vs. speedOpenAI text-embedding-3-small (best value)
Chunking StrategyGranularity vs. semantic integrityRecursive chunking, 200–500 tokens
Vector DatabaseScale vs. operational costChroma for small projects, Pinecone/Milvus for production
Similarity MetricSemantic vs. exactCosine similarity (go-to for text scenarios)
Top-K ValueRecall vs. noiseRetrieve 20 first, rerank and take Top 5

Summary

Embedding and vector retrieval form the bridge connecting "human language" and "machine understanding." They are also the foundational infrastructure for AI applications such as RAG, semantic search, and recommendation systems.

Key takeaways from this chapter:

  1. The essence of Embedding: Mapping text into a high-dimensional vector space, turning semantic similarity into spatial distance
  2. Similarity metrics: Cosine similarity focuses on direction (ideal for text), Euclidean distance focuses on absolute distance
  3. Indexing is the performance key: HNSW and IVF reduce retrieval across millions of vectors to milliseconds
  4. Vector database selection: Chroma/pgvector for small projects, Pinecone/Milvus for production
  5. End-to-end thinking: From document loading to final retrieval, every stage's choices affect the final outcome

Further Reading