Lesson 1 of 6·11 min read

Understanding Vector Databases

Vector databases are the foundation of modern RAG systems. They store data as high-dimensional vectors (embeddings) and enable lightning-fast similarity searches — the basis for AI models to access your company knowledge.

What Are Embeddings?

Embeddings are numerical representations of text, images, or other data in a high-dimensional space. Similar content has similar vectors:

"Artificial Intelligence" → [0.23, -0.45, 0.89, 0.12, ...]  (1536 dimensions)
"Machine Learning"        → [0.25, -0.42, 0.87, 0.15, ...]  (very similar!)
"Cake recipe"            → [-0.67, 0.33, -0.12, 0.78, ...]  (completely different)

Similarity Search

The core function of a vector database: Find the N most similar vectors to a query vector.

Distance Metrics

MetricDescriptionWhen to use
Cosine SimilarityAngle between vectors (0-1)Text embeddings (default)
Euclidean DistanceGeometric distanceWhen magnitude matters
Dot ProductScalar productNormalized embeddings

ANN Algorithms

Exact nearest neighbor search is too slow with millions of vectors. Approximate Nearest Neighbor (ANN) algorithms deliver 95-99% accuracy with dramatically better performance:

HNSW (Hierarchical Navigable Small World)

Layer 3:  [ A ] ─────────────── [ B ]
Layer 2:  [ A ] ──── [ C ] ──── [ B ]
Layer 1:  [ A ] ── [D] ── [ C ] ── [E] ── [ B ]
Layer 0:  [A] [F] [D] [G] [C] [H] [E] [I] [B]
  • How it works: Hierarchical graph structure with skip-list principle
  • Strengths: Very fast queries, good recall rate
  • Weaknesses: High memory usage, slower index building

IVF (Inverted File Index)

  • How it works: Vectors are grouped into clusters, only relevant clusters are searched
  • Strengths: Low memory usage, fast index building
  • Weaknesses: Slightly lower recall rate

Vector Database Comparison

DatabaseTypeStrengthsWeaknessesBest for
PineconeManaged cloudZero-ops, auto-scalesVendor lock-in, costTeams without DB expertise
WeaviateOpen sourceHybrid search, GraphQL APIMore complex setupHybrid search scenarios
ChromaOpen sourceEasiest start, embeddedNot for large scalePrototypes, small projects
QdrantOpen sourceRust performance, filteringSmaller ecosystemPerformance-critical apps
pgvectorPostgreSQL extensionUses existing Postgres infraLess specializedTeams with PostgreSQL stack

pgvector Example

-- Enable extension
CREATE EXTENSION vector;

-- Table with vector column
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding VECTOR(1536)
);

-- Index for fast search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Similarity search
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 5;

Practical tip: For prototypes, use Chroma (easiest start) or pgvector (if you already have PostgreSQL). For production with >1M documents, evaluate Pinecone (managed) or Qdrant (self-hosted). The database choice matters less than the quality of your embeddings and chunks.

📝

Quiz

Question 1 of 3

Was ist der Hauptvorteil von HNSW gegenüber exakter Nearest-Neighbor-Suche?