Lesson 5 of 6·10 min read

Multi-Modal & Agents

The Vercel AI SDK supports not only text — but also images, audio, and multi-step agent loops. This lesson shows how to build multimodal applications and autonomous AI agents.

Image & Audio Input

Sending Images to LLMs

Vision-capable models (GPT-4.1, Claude Sonnet 4, Gemini 2.5) analyze images:

import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const { text } = await generateText({
  model: openai('gpt-4.1'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What do you see in this image?' },
        { type: 'image', image: new URL('https://example.com/photo.jpg') },
      ],
    },
  ],
})

Use Cases for Vision

  • Product recognition: Photo → product name, category, price
  • Document analysis: Photograph invoice → extract structured data
  • UI review: Screenshot → accessibility and design feedback
  • Chart analysis: Diagram → data interpretation and summary

Audio Input

With models that support audio (e.g., Gemini 2.5):

const { text } = await generateText({
  model: google('gemini-2.5-pro'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Transcribe and summarize this recording.' },
        { type: 'file', data: audioBuffer, mimeType: 'audio/mp3' },
      ],
    },
  ],
})

Agent Loops

What Are AI Agents?

An agent is an LLM that autonomously makes decisions, calls tools, and iterates until a task is complete. In the Vercel AI SDK, you activate agents via maxSteps:

const result = streamText({
  model: anthropic('claude-sonnet-4-20250514'),
  system: 'You are a research agent. Use the available tools to thoroughly research questions.',
  messages,
  tools: {
    webSearch: webSearchTool,
    readPage: readPageTool,
    saveNote: saveNoteTool,
  },
  maxSteps: 10,
})

Agent Flow

  1. Reasoning: LLM analyzes the task
  2. Tool selection: LLM chooses the appropriate tool
  3. Execution: Tool is executed, result returned to LLM
  4. Evaluation: LLM checks if the task is complete
  5. Iteration: If not complete → back to step 1
  6. Response: Final answer to the user

Multi-Step Reasoning

Complex Tasks with Agent Loops

Example: Research Agent

User: "Create a comparison of the top 3 vector databases for our use case (e-commerce, 10M products, hybrid search)"

Agent steps:

  1. webSearch("vector database comparison 2026") → overview article
  2. readPage("pinecone.io/pricing") → Pinecone pricing and features
  3. readPage("qdrant.tech/documentation") → Qdrant capabilities
  4. readPage("weaviate.io/developers") → Weaviate hybrid search
  5. saveNote({ title: "DB Comparison", content: ... }) → save note
  6. Final answer: Structured comparison with recommendation

Conversation Memory

Long-Term Memory for Agents

useChat manages conversation history automatically, but for long-term memory you need persistence:

export async function POST(req: Request) {
  const { messages, sessionId } = await req.json()

  // Load previous conversation
  const history = await loadConversation(sessionId)

  const result = streamText({
    model: openai('gpt-4.1'),
    messages: [...history, ...messages],
    onFinish: async ({ text }) => {
      // Save conversation
      await saveConversation(sessionId, [...history, ...messages, { role: 'assistant', content: text }])
    },
  })

  return result.toDataStreamResponse()
}

Memory Strategies

StrategyDescriptionContext Usage
Full historyKeep all messagesHigh — context limit reached quickly
Sliding windowOnly the last N messagesMedium — older context is lost
SummarySummarize older messagesLow — compression with information loss
HybridSummary + last N messagesOptimal — balance of context and efficiency

Middleware

AI SDK Middleware

Middleware enables cross-cutting concerns like logging, caching, and guardrails:

  • Logging middleware: Log every request and response
  • Caching middleware: Answer identical prompts from cache
  • Rate limiting: Enforce token budget per user
  • Guardrails: Check output for PII, toxicity, or off-topic content

Agent reality: Autonomous agents are powerful but unpredictable. Set maxSteps conservatively (5–10), implement abort conditions, and monitor token consumption. An endlessly running agent can cost hundreds of dollars in minutes.