Lesson 5 of 5·10 min read

Cost Optimization

AI costs can quickly spiral out of control. The good news: With the right strategies, you can reduce your inference costs by 50–80% — without losing quality.

The 5 Levers of Cost Optimization

1. Intelligent Model Routing

Not every request needs the most powerful model.

Routing strategy:

  • Simple questions (FAQ, summarization): GPT-4o-mini or Llama 3 8B → ~95% cheaper
  • Standard tasks (text generation, analysis): GPT-4o or Claude Sonnet → baseline
  • Complex tasks (code review, reasoning): Claude Opus or GPT-4o → premium cost

Automatic classification: A small classifier model (< 1B parameters) decides in < 10 ms which model handles the request. Savings: 40–60% of total costs.

2. Caching Strategies

The cheapest token is the one you don't generate.

  • Exact match cache: Identical requests → Redis lookup (1 ms instead of 2 s)
  • Semantic cache: Similar requests → vector similarity search
  • Prompt caching: Anthropic/OpenAI offer up to 90% discount on repeated prompt prefixes
  • Response caching: Stable responses (e.g., product descriptions) cached with TTL

Typical cache hit rate: 20–40% → direct cost reduction.

3. Batching

Bundle requests instead of sending individually.

  • Synchronous batching: Collect requests for 100 ms, then send as batch
  • Asynchronous batching: Non-time-critical tasks (reports, analyses) processed overnight
  • Batch APIs: OpenAI offers 50% discount for async batch requests (24h SLA)

4. Use Smaller Models Strategically

Larger models aren't always better.

Benchmark results 2026:

  • GPT-4o-mini achieves 92% of GPT-4o quality at 1/20th the cost
  • Llama 3.2 3B for classification: 97% accuracy at 1/100th the cost of a 70B model
  • Specialized fine-tuned models beat general-purpose models in their domain

Rule: Always test the smallest model first. Scale up only if quality is insufficient.

5. Token Optimization

Fewer tokens = lower costs.

  • Shorten prompts: Reduce system prompts to essentials (often 50% shorter possible)
  • Limit output: Set max_tokens to prevent endless responses
  • Structured output: JSON instead of prose — more precise and token-efficient
  • Context window: Only relevant documents in context, not all

Cost Dashboard

Track daily:

  1. Cost per use case (not just total)
  2. Cost per user (identify power users)
  3. Cache hit rate (target: > 30%)
  4. Model distribution (what % runs on cheaper model?)

Target 2026: Under €0.01 per user interaction. With the right optimizations, this is achievable for most use cases.

📝

Quiz

Question 1 of 3

Welche Kostenoptimierungsstrategie bringt typischerweise die größte Einsparung?