Production Deployment

Getting an AI application running locally is the easy part. Running it reliably, cost-efficiently, and at scale in production requires specific strategies for Vercel and the AI SDK.

Vercel Deployment

Deployment Configuration

AI applications on Vercel require special attention:

// next.config.ts
export default {
  experimental: {
    serverActions: {
      bodySizeLimit: '4mb', // For image uploads
    },
  },
}

Environment Variables

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key	Yes (with OpenAI)
`ANTHROPIC_API_KEY`	Anthropic API key	Yes (with Anthropic)
`AI_PROVIDER`	Default provider	Optional
`AI_MAX_TOKENS`	Token limit per request	Recommended
`AI_RATE_LIMIT`	Requests per minute	Recommended

Vercel AI Gateway

Vercel offers an AI Gateway as proxy between your application and AI providers:

Caching: Answer identical requests from cache
Rate limiting: Enforce per-user limits
Fallback: Automatically switch to alternative provider on failure
Analytics: Track token consumption, latency, and error rates

Edge Runtime

When to Use Edge Runtime?

Vercel Edge Runtime executes code at globally distributed edge locations — minimal latency:

export const runtime = 'edge'

export async function POST(req: Request) {
  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4.1'),
    messages,
  })

  return result.toDataStreamResponse()
}

Edge Runtime is suitable for:

Simple chat endpoints without database access
Streaming responses (lower time-to-first-byte)
Global applications with users in different regions

Prefer Node.js Runtime for:

Database access (Supabase, Prisma)
File system access
Native Node.js modules
Long-running operations (> 30 seconds)

Rate Limiting

Implementation with Upstash

import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests/minute
})

export async function POST(req: Request) {
  const userId = getUserId(req)
  const { success } = await ratelimit.limit(userId)

  if (!success) {
    return new Response('Rate limit exceeded', { status: 429 })
  }

  // ... AI SDK logic
}

Multi-Level Limits

Level	Limit	Purpose
Free tier	20 requests/day	Testing and onboarding
Pro tier	100 requests/hour	Normal usage
Enterprise	Custom	By agreement
Per-model	Variable	Limit expensive models more

Cost Tracking

Monitor Token Consumption

Every AI call consumes tokens — and tokens cost money:

const result = await generateText({
  model: openai('gpt-4.1'),
  prompt: '...',
})

// Log token consumption
console.log({
  inputTokens: result.usage.promptTokens,
  outputTokens: result.usage.completionTokens,
  totalTokens: result.usage.totalTokens,
  estimatedCost: calculateCost(result.usage),
})

Cost per 1M Tokens (2026)

Model	Input	Output
GPT-4.1	$2.00	$8.00
GPT-4.1 mini	$0.40	$1.60
Claude Sonnet 4	$3.00	$15.00
Gemini 2.5 Flash	$0.15	$0.60

Cost Optimization

Model routing: Simple questions → cheap model, complex → expensive
Prompt caching: Anthropic and OpenAI offer prompt caching with 50–90% discount
Output limits: Set maxTokens to avoid endless responses
Caching layer: Answer frequent questions from cache

A/B Testing & Monitoring

A/B Testing AI Features

const model = abTest('ai-model-test', userId) === 'variant_a'
  ? openai('gpt-4.1-mini')
  : anthropic('claude-sonnet-4-20250514')

const result = streamText({ model, messages })

What to A/B test:

Different models (GPT vs. Claude vs. Gemini)
System prompts (short vs. detailed)
Temperature settings
RAG strategies (top-3 vs. top-5 chunks)

Monitoring Dashboard

Track these metrics in production:

Latency: Time-to-first-token, total response time
Error rate: Provider errors, rate limits, timeouts
Token consumption: Per user, per feature, per day
Cost: Daily, weekly, and monthly AI costs
Quality: User feedback (thumbs up/down), retry rate

Production rule: Deploy AI features behind feature flags. Start with 5% of users, measure costs and quality, and scale gradually to 100%. An uncontrolled rollout can blow up your API bill in hours.