Lesson 2 of 5·12 min read

GPU Selection and Inference Costs

GPUs are the heart of every AI infrastructure. The right choice determines performance, cost, and scalability of your AI applications.

The GPU Landscape 2026

NVIDIA H100 — The Current Standard

  • 80 GB HBM3 memory, up to 3,958 TFLOPS (FP8)
  • Price: ~€30,000–40,000 per GPU (single purchase)
  • Cloud cost: ~€2.50–4.00/hour (on-demand)
  • Ideal for: Inference of medium to large models (up to 70B parameters)

NVIDIA H200 — More Memory, More Speed

  • 141 GB HBM3e memory — nearly double the H100
  • 30–40% faster inference through higher bandwidth
  • Price: ~€35,000–50,000 per GPU
  • Ideal for: Large models (70B+), long contexts, multi-modal

NVIDIA B200 (Blackwell) — Next Generation

  • 192 GB HBM3e, FP4 support for efficient inference
  • Up to 2.5x faster than H100 for inference
  • Availability: Increasingly available from Q2 2026
  • Ideal for: New investments aiming for future-proofing

Alternatives

  • AMD MI300X: 192 GB HBM3, competitive price/performance
  • Google TPU v5p: Optimal for JAX/TensorFlow workloads on GCP
  • AWS Inferentia2: Cheapest option for pure inference workloads

Calculating Inference Costs

API-based (Managed)

Simplest approach — you pay per token:

ModelInput (1M tokens)Output (1M tokens)
GPT-4o~€2.50~€10.00
Claude 3.5 Sonnet~€3.00~€15.00
Llama 3 70B (hosted)~€0.60~€0.80

Self-Hosted

Own GPU infrastructure — higher upfront costs, but cheaper at volume:

Cost calculation per request:

  1. GPU hour: ~€3.00 (H100 cloud) or ~€0.80 (owned, amortized over 3 years)
  2. Throughput: ~50 requests/second (Llama 70B, optimized)
  3. Cost per request: ~€0.000016 (self-hosted) vs. ~€0.002 (API)

Optimization Strategies

  • Quantization: FP16 → INT8 → INT4 reduces memory by 50–75%, latency by 30–50%
  • Batching: Process multiple requests simultaneously — triple throughput
  • Model distillation: Train smaller models that imitate the large model
  • vLLM & TensorRT-LLM: Optimized inference engines with PagedAttention

Decision guide: Under 10,000 requests/day → API. Over 100,000 → evaluate self-hosted. In between → depends on the use case.