RAG with AI SDK

Retrieval-Augmented Generation (RAG) connects LLMs with your own data — knowledge bases, documents, databases. The AI SDK offers native support for embedding generation and seamless integration with vector stores.

Embedding Generation

What Are Embeddings?

Embeddings are numerical representations of text in a high-dimensional vector space. Semantically similar texts have similar vectors — this enables semantic search.

Embeddings with the AI SDK

import { embed, embedMany } from 'ai'
import { openai } from '@ai-sdk/openai'

// Single embedding
const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-large'),
  value: 'What is Kubernetes?',
})

// Batch embeddings
const { embeddings } = await embedMany({
  model: openai.embedding('text-embedding-3-large'),
  values: ['Document 1...', 'Document 2...', 'Document 3...'],
})

Embedding Models 2026

Model	Dimensions	Strength
text-embedding-3-large (OpenAI)	3,072	Best quality, versatile
text-embedding-3-small (OpenAI)	1,536	Good price-performance ratio
voyage-3-large (Anthropic/Voyage)	1,024	Strong for code and technical texts
multilingual-e5-large (open source)	1,024	Multilingual, self-hosted possible

Vector Store Integration

Supported Vector Stores

The AI SDK integrates with all major vector databases:

Pinecone: Managed, serverless, scales automatically
Supabase pgvector: PostgreSQL extension, ideal for Supabase users
Weaviate: Open source, hybrid search (vector + keyword)
Qdrant: Open source, high performance, Rust-based
ChromaDB: Simple, good for prototyping

RAG Pipeline with Supabase pgvector

1. Index documents:

import { embedMany } from 'ai'
import { openai } from '@ai-sdk/openai'
import { supabase } from '@/lib/supabase'

async function indexDocuments(documents: string[]) {
  const { embeddings } = await embedMany({
    model: openai.embedding('text-embedding-3-large'),
    values: documents,
  })

  for (let i = 0; i < documents.length; i++) {
    await supabase.from('documents').insert({
      content: documents[i],
      embedding: embeddings[i],
    })
  }
}

2. Retrieve relevant documents:

async function findRelevantDocs(query: string) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-large'),
    value: query,
  })

  const { data } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.7,
    match_count: 5,
  })

  return data
}

3. Pass context to LLM:

export async function POST(req: Request) {
  const { messages } = await req.json()
  const lastMessage = messages[messages.length - 1]

  const relevantDocs = await findRelevantDocs(lastMessage.content)
  const context = relevantDocs.map(d => d.content).join('\n\n')

  const result = streamText({
    model: openai('gpt-4.1'),
    system: `Answer questions based on this context:\n\n${context}`,
    messages,
  })

  return result.toDataStreamResponse()
}

Context Window Management

The Core Problem

LLMs have a limited context window (128K–2M tokens). Effective RAG must pack relevant information into this window — without overloading it.

Strategies

Strategy	Description	When
Top-K retrieval	The K most similar documents	Standard
Reranking	Re-sort results with a reranker model	Higher quality
Chunking	Split documents into smaller pieces	Long documents
Hierarchical	Search coarsely first, then in detail	Large knowledge bases
Hybrid search	Combine vector + keyword + filters	Complex queries

Chunking Best Practices

Chunk size: 500–1,000 tokens per chunk (too small = context lost, too large = noise)
Overlap: 10–20% overlap between chunks (avoid context breaks)
Semantic chunking: Split at natural boundaries (paragraphs, chapters) — not mid-sentence
Metadata: Each chunk stores source, chapter, and position for citations

RAG reality: The quality of your RAG system depends 80% on data preparation (chunking, cleaning, metadata) and only 20% on the model. Invest in your data pipeline.