Multi-Modal & Agents

The Vercel AI SDK supports not only text — but also images, audio, and multi-step agent loops. This lesson shows how to build multimodal applications and autonomous AI agents.

Image & Audio Input

Sending Images to LLMs

Vision-capable models (GPT-4.1, Claude Sonnet 4, Gemini 2.5) analyze images:

import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const { text } = await generateText({
  model: openai('gpt-4.1'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What do you see in this image?' },
        { type: 'image', image: new URL('https://example.com/photo.jpg') },
      ],
    },
  ],
})

Use Cases for Vision

Product recognition: Photo → product name, category, price
Document analysis: Photograph invoice → extract structured data
UI review: Screenshot → accessibility and design feedback
Chart analysis: Diagram → data interpretation and summary

Audio Input

With models that support audio (e.g., Gemini 2.5):

const { text } = await generateText({
  model: google('gemini-2.5-pro'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Transcribe and summarize this recording.' },
        { type: 'file', data: audioBuffer, mimeType: 'audio/mp3' },
      ],
    },
  ],
})

Agent Loops

What Are AI Agents?

An agent is an LLM that autonomously makes decisions, calls tools, and iterates until a task is complete. In the Vercel AI SDK, you activate agents via maxSteps:

const result = streamText({
  model: anthropic('claude-sonnet-4-20250514'),
  system: 'You are a research agent. Use the available tools to thoroughly research questions.',
  messages,
  tools: {
    webSearch: webSearchTool,
    readPage: readPageTool,
    saveNote: saveNoteTool,
  },
  maxSteps: 10,
})

Agent Flow

Reasoning: LLM analyzes the task
Tool selection: LLM chooses the appropriate tool
Execution: Tool is executed, result returned to LLM
Evaluation: LLM checks if the task is complete
Iteration: If not complete → back to step 1
Response: Final answer to the user

Multi-Step Reasoning

Complex Tasks with Agent Loops

Example: Research Agent

User: "Create a comparison of the top 3 vector databases for our use case (e-commerce, 10M products, hybrid search)"

Agent steps:

webSearch("vector database comparison 2026") → overview article
readPage("pinecone.io/pricing") → Pinecone pricing and features
readPage("qdrant.tech/documentation") → Qdrant capabilities
readPage("weaviate.io/developers") → Weaviate hybrid search
saveNote({ title: "DB Comparison", content: ... }) → save note
Final answer: Structured comparison with recommendation

Conversation Memory

Long-Term Memory for Agents

useChat manages conversation history automatically, but for long-term memory you need persistence:

export async function POST(req: Request) {
  const { messages, sessionId } = await req.json()

  // Load previous conversation
  const history = await loadConversation(sessionId)

  const result = streamText({
    model: openai('gpt-4.1'),
    messages: [...history, ...messages],
    onFinish: async ({ text }) => {
      // Save conversation
      await saveConversation(sessionId, [...history, ...messages, { role: 'assistant', content: text }])
    },
  })

  return result.toDataStreamResponse()
}

Memory Strategies

Strategy	Description	Context Usage
Full history	Keep all messages	High — context limit reached quickly
Sliding window	Only the last N messages	Medium — older context is lost
Summary	Summarize older messages	Low — compression with information loss
Hybrid	Summary + last N messages	Optimal — balance of context and efficiency

Middleware

AI SDK Middleware

Middleware enables cross-cutting concerns like logging, caching, and guardrails:

Logging middleware: Log every request and response
Caching middleware: Answer identical prompts from cache
Rate limiting: Enforce token budget per user
Guardrails: Check output for PII, toxicity, or off-topic content

Agent reality: Autonomous agents are powerful but unpredictable. Set maxSteps conservatively (5–10), implement abort conditions, and monitor token consumption. An endlessly running agent can cost hundreds of dollars in minutes.