Lesson 4 of 6·10 min read

RAG with AI SDK

Retrieval-Augmented Generation (RAG) connects LLMs with your own data — knowledge bases, documents, databases. The AI SDK offers native support for embedding generation and seamless integration with vector stores.

Embedding Generation

What Are Embeddings?

Embeddings are numerical representations of text in a high-dimensional vector space. Semantically similar texts have similar vectors — this enables semantic search.

Embeddings with the AI SDK

import { embed, embedMany } from 'ai'
import { openai } from '@ai-sdk/openai'

// Single embedding
const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-large'),
  value: 'What is Kubernetes?',
})

// Batch embeddings
const { embeddings } = await embedMany({
  model: openai.embedding('text-embedding-3-large'),
  values: ['Document 1...', 'Document 2...', 'Document 3...'],
})

Embedding Models 2026

ModelDimensionsStrength
text-embedding-3-large (OpenAI)3,072Best quality, versatile
text-embedding-3-small (OpenAI)1,536Good price-performance ratio
voyage-3-large (Anthropic/Voyage)1,024Strong for code and technical texts
multilingual-e5-large (open source)1,024Multilingual, self-hosted possible

Vector Store Integration

Supported Vector Stores

The AI SDK integrates with all major vector databases:

  • Pinecone: Managed, serverless, scales automatically
  • Supabase pgvector: PostgreSQL extension, ideal for Supabase users
  • Weaviate: Open source, hybrid search (vector + keyword)
  • Qdrant: Open source, high performance, Rust-based
  • ChromaDB: Simple, good for prototyping

RAG Pipeline with Supabase pgvector

1. Index documents:

import { embedMany } from 'ai'
import { openai } from '@ai-sdk/openai'
import { supabase } from '@/lib/supabase'

async function indexDocuments(documents: string[]) {
  const { embeddings } = await embedMany({
    model: openai.embedding('text-embedding-3-large'),
    values: documents,
  })

  for (let i = 0; i < documents.length; i++) {
    await supabase.from('documents').insert({
      content: documents[i],
      embedding: embeddings[i],
    })
  }
}

2. Retrieve relevant documents:

async function findRelevantDocs(query: string) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-large'),
    value: query,
  })

  const { data } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.7,
    match_count: 5,
  })

  return data
}

3. Pass context to LLM:

export async function POST(req: Request) {
  const { messages } = await req.json()
  const lastMessage = messages[messages.length - 1]

  const relevantDocs = await findRelevantDocs(lastMessage.content)
  const context = relevantDocs.map(d => d.content).join('\n\n')

  const result = streamText({
    model: openai('gpt-4.1'),
    system: `Answer questions based on this context:\n\n${context}`,
    messages,
  })

  return result.toDataStreamResponse()
}

Context Window Management

The Core Problem

LLMs have a limited context window (128K–2M tokens). Effective RAG must pack relevant information into this window — without overloading it.

Strategies

StrategyDescriptionWhen
Top-K retrievalThe K most similar documentsStandard
RerankingRe-sort results with a reranker modelHigher quality
ChunkingSplit documents into smaller piecesLong documents
HierarchicalSearch coarsely first, then in detailLarge knowledge bases
Hybrid searchCombine vector + keyword + filtersComplex queries

Chunking Best Practices

  • Chunk size: 500–1,000 tokens per chunk (too small = context lost, too large = noise)
  • Overlap: 10–20% overlap between chunks (avoid context breaks)
  • Semantic chunking: Split at natural boundaries (paragraphs, chapters) — not mid-sentence
  • Metadata: Each chunk stores source, chapter, and position for citations

RAG reality: The quality of your RAG system depends 80% on data preparation (chunking, cleaning, metadata) and only 20% on the model. Invest in your data pipeline.