The Vercel AI SDK supports not only text — but also images, audio, and multi-step agent loops. This lesson shows how to build multimodal applications and autonomous AI agents.
Vision-capable models (GPT-4.1, Claude Sonnet 4, Gemini 2.5) analyze images:
import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
const { text } = await generateText({
model: openai('gpt-4.1'),
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What do you see in this image?' },
{ type: 'image', image: new URL('https://example.com/photo.jpg') },
],
},
],
})
With models that support audio (e.g., Gemini 2.5):
const { text } = await generateText({
model: google('gemini-2.5-pro'),
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Transcribe and summarize this recording.' },
{ type: 'file', data: audioBuffer, mimeType: 'audio/mp3' },
],
},
],
})
An agent is an LLM that autonomously makes decisions, calls tools, and iterates until a task is complete. In the Vercel AI SDK, you activate agents via maxSteps:
const result = streamText({
model: anthropic('claude-sonnet-4-20250514'),
system: 'You are a research agent. Use the available tools to thoroughly research questions.',
messages,
tools: {
webSearch: webSearchTool,
readPage: readPageTool,
saveNote: saveNoteTool,
},
maxSteps: 10,
})
Example: Research Agent
User: "Create a comparison of the top 3 vector databases for our use case (e-commerce, 10M products, hybrid search)"
Agent steps:
webSearch("vector database comparison 2026") → overview articlereadPage("pinecone.io/pricing") → Pinecone pricing and featuresreadPage("qdrant.tech/documentation") → Qdrant capabilitiesreadPage("weaviate.io/developers") → Weaviate hybrid searchsaveNote({ title: "DB Comparison", content: ... }) → save noteuseChat manages conversation history automatically, but for long-term memory you need persistence:
export async function POST(req: Request) {
const { messages, sessionId } = await req.json()
// Load previous conversation
const history = await loadConversation(sessionId)
const result = streamText({
model: openai('gpt-4.1'),
messages: [...history, ...messages],
onFinish: async ({ text }) => {
// Save conversation
await saveConversation(sessionId, [...history, ...messages, { role: 'assistant', content: text }])
},
})
return result.toDataStreamResponse()
}
| Strategy | Description | Context Usage |
|---|---|---|
| Full history | Keep all messages | High — context limit reached quickly |
| Sliding window | Only the last N messages | Medium — older context is lost |
| Summary | Summarize older messages | Low — compression with information loss |
| Hybrid | Summary + last N messages | Optimal — balance of context and efficiency |
Middleware enables cross-cutting concerns like logging, caching, and guardrails:
Agent reality: Autonomous agents are powerful but unpredictable. Set
maxStepsconservatively (5–10), implement abort conditions, and monitor token consumption. An endlessly running agent can cost hundreds of dollars in minutes.