GPU Selection and Inference Costs
GPUs are the heart of every AI infrastructure. The right choice determines performance, cost, and scalability of your AI applications.
The GPU Landscape 2026
NVIDIA H100 — The Current Standard
- 80 GB HBM3 memory, up to 3,958 TFLOPS (FP8)
- Price: ~€30,000–40,000 per GPU (single purchase)
- Cloud cost: ~€2.50–4.00/hour (on-demand)
- Ideal for: Inference of medium to large models (up to 70B parameters)
NVIDIA H200 — More Memory, More Speed
- 141 GB HBM3e memory — nearly double the H100
- 30–40% faster inference through higher bandwidth
- Price: ~€35,000–50,000 per GPU
- Ideal for: Large models (70B+), long contexts, multi-modal
NVIDIA B200 (Blackwell) — Next Generation
- 192 GB HBM3e, FP4 support for efficient inference
- Up to 2.5x faster than H100 for inference
- Availability: Increasingly available from Q2 2026
- Ideal for: New investments aiming for future-proofing
Alternatives
- AMD MI300X: 192 GB HBM3, competitive price/performance
- Google TPU v5p: Optimal for JAX/TensorFlow workloads on GCP
- AWS Inferentia2: Cheapest option for pure inference workloads
Calculating Inference Costs
API-based (Managed)
Simplest approach — you pay per token:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|
| GPT-4o | ~€2.50 | ~€10.00 |
| Claude 3.5 Sonnet | ~€3.00 | ~€15.00 |
| Llama 3 70B (hosted) | ~€0.60 | ~€0.80 |
Self-Hosted
Own GPU infrastructure — higher upfront costs, but cheaper at volume:
Cost calculation per request:
- GPU hour: ~€3.00 (H100 cloud) or ~€0.80 (owned, amortized over 3 years)
- Throughput: ~50 requests/second (Llama 70B, optimized)
- Cost per request: ~€0.000016 (self-hosted) vs. ~€0.002 (API)
Optimization Strategies
- Quantization: FP16 → INT8 → INT4 reduces memory by 50–75%, latency by 30–50%
- Batching: Process multiple requests simultaneously — triple throughput
- Model distillation: Train smaller models that imitate the large model
- vLLM & TensorRT-LLM: Optimized inference engines with PagedAttention
Decision guide: Under 10,000 requests/day → API. Over 100,000 → evaluate self-hosted. In between → depends on the use case.