Lesson 6 of 6·9 min read

Production Deployment

Getting an AI application running locally is the easy part. Running it reliably, cost-efficiently, and at scale in production requires specific strategies for Vercel and the AI SDK.

Vercel Deployment

Deployment Configuration

AI applications on Vercel require special attention:

// next.config.ts
export default {
  experimental: {
    serverActions: {
      bodySizeLimit: '4mb', // For image uploads
    },
  },
}

Environment Variables

VariableDescriptionRequired
OPENAI_API_KEYOpenAI API keyYes (with OpenAI)
ANTHROPIC_API_KEYAnthropic API keyYes (with Anthropic)
AI_PROVIDERDefault providerOptional
AI_MAX_TOKENSToken limit per requestRecommended
AI_RATE_LIMITRequests per minuteRecommended

Vercel AI Gateway

Vercel offers an AI Gateway as proxy between your application and AI providers:

  • Caching: Answer identical requests from cache
  • Rate limiting: Enforce per-user limits
  • Fallback: Automatically switch to alternative provider on failure
  • Analytics: Track token consumption, latency, and error rates

Edge Runtime

When to Use Edge Runtime?

Vercel Edge Runtime executes code at globally distributed edge locations — minimal latency:

export const runtime = 'edge'

export async function POST(req: Request) {
  const { messages } = await req.json()

  const result = streamText({
    model: openai('gpt-4.1'),
    messages,
  })

  return result.toDataStreamResponse()
}

Edge Runtime is suitable for:

  • Simple chat endpoints without database access
  • Streaming responses (lower time-to-first-byte)
  • Global applications with users in different regions

Prefer Node.js Runtime for:

  • Database access (Supabase, Prisma)
  • File system access
  • Native Node.js modules
  • Long-running operations (> 30 seconds)

Rate Limiting

Implementation with Upstash

import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests/minute
})

export async function POST(req: Request) {
  const userId = getUserId(req)
  const { success } = await ratelimit.limit(userId)

  if (!success) {
    return new Response('Rate limit exceeded', { status: 429 })
  }

  // ... AI SDK logic
}

Multi-Level Limits

LevelLimitPurpose
Free tier20 requests/dayTesting and onboarding
Pro tier100 requests/hourNormal usage
EnterpriseCustomBy agreement
Per-modelVariableLimit expensive models more

Cost Tracking

Monitor Token Consumption

Every AI call consumes tokens — and tokens cost money:

const result = await generateText({
  model: openai('gpt-4.1'),
  prompt: '...',
})

// Log token consumption
console.log({
  inputTokens: result.usage.promptTokens,
  outputTokens: result.usage.completionTokens,
  totalTokens: result.usage.totalTokens,
  estimatedCost: calculateCost(result.usage),
})

Cost per 1M Tokens (2026)

ModelInputOutput
GPT-4.1$2.00$8.00
GPT-4.1 mini$0.40$1.60
Claude Sonnet 4$3.00$15.00
Gemini 2.5 Flash$0.15$0.60

Cost Optimization

  • Model routing: Simple questions → cheap model, complex → expensive
  • Prompt caching: Anthropic and OpenAI offer prompt caching with 50–90% discount
  • Output limits: Set maxTokens to avoid endless responses
  • Caching layer: Answer frequent questions from cache

A/B Testing & Monitoring

A/B Testing AI Features

const model = abTest('ai-model-test', userId) === 'variant_a'
  ? openai('gpt-4.1-mini')
  : anthropic('claude-sonnet-4-20250514')

const result = streamText({ model, messages })

What to A/B test:

  • Different models (GPT vs. Claude vs. Gemini)
  • System prompts (short vs. detailed)
  • Temperature settings
  • RAG strategies (top-3 vs. top-5 chunks)

Monitoring Dashboard

Track these metrics in production:

  • Latency: Time-to-first-token, total response time
  • Error rate: Provider errors, rate limits, timeouts
  • Token consumption: Per user, per feature, per day
  • Cost: Daily, weekly, and monthly AI costs
  • Quality: User feedback (thumbs up/down), retry rate

Production rule: Deploy AI features behind feature flags. Start with 5% of users, measure costs and quality, and scale gradually to 100%. An uncontrolled rollout can blow up your API bill in hours.

📝

Quiz

Question 1 of 3

Wann sollte man Edge Runtime statt Node.js Runtime für AI-Endpoints verwenden?