Understanding Vector Databases

Vector databases are the foundation of modern RAG systems. They store data as high-dimensional vectors (embeddings) and enable lightning-fast similarity searches — the basis for AI models to access your company knowledge.

What Are Embeddings?

Embeddings are numerical representations of text, images, or other data in a high-dimensional space. Similar content has similar vectors:

"Artificial Intelligence" → [0.23, -0.45, 0.89, 0.12, ...]  (1536 dimensions)
"Machine Learning"        → [0.25, -0.42, 0.87, 0.15, ...]  (very similar!)
"Cake recipe"            → [-0.67, 0.33, -0.12, 0.78, ...]  (completely different)

Similarity Search

The core function of a vector database: Find the N most similar vectors to a query vector.

Distance Metrics

Metric	Description	When to use
Cosine Similarity	Angle between vectors (0-1)	Text embeddings (default)
Euclidean Distance	Geometric distance	When magnitude matters
Dot Product	Scalar product	Normalized embeddings

ANN Algorithms

Exact nearest neighbor search is too slow with millions of vectors. Approximate Nearest Neighbor (ANN) algorithms deliver 95-99% accuracy with dramatically better performance:

HNSW (Hierarchical Navigable Small World)

Layer 3:  [ A ] ─────────────── [ B ]
Layer 2:  [ A ] ──── [ C ] ──── [ B ]
Layer 1:  [ A ] ── [D] ── [ C ] ── [E] ── [ B ]
Layer 0:  [A] [F] [D] [G] [C] [H] [E] [I] [B]

How it works: Hierarchical graph structure with skip-list principle
Strengths: Very fast queries, good recall rate
Weaknesses: High memory usage, slower index building

IVF (Inverted File Index)

How it works: Vectors are grouped into clusters, only relevant clusters are searched
Strengths: Low memory usage, fast index building
Weaknesses: Slightly lower recall rate

Vector Database Comparison

Database	Type	Strengths	Weaknesses	Best for
Pinecone	Managed cloud	Zero-ops, auto-scales	Vendor lock-in, cost	Teams without DB expertise
Weaviate	Open source	Hybrid search, GraphQL API	More complex setup	Hybrid search scenarios
Chroma	Open source	Easiest start, embedded	Not for large scale	Prototypes, small projects
Qdrant	Open source	Rust performance, filtering	Smaller ecosystem	Performance-critical apps
pgvector	PostgreSQL extension	Uses existing Postgres infra	Less specialized	Teams with PostgreSQL stack

pgvector Example

-- Enable extension
CREATE EXTENSION vector;

-- Table with vector column
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding VECTOR(1536)
);

-- Index for fast search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Similarity search
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 5;

Practical tip: For prototypes, use Chroma (easiest start) or pgvector (if you already have PostgreSQL). For production with >1M documents, evaluate Pinecone (managed) or Qdrant (self-hosted). The database choice matters less than the quality of your embeddings and chunks.