Lesson 2 of 5·12 min read

GPU Selection and Inference Costs

GPUs are the heart of every AI infrastructure. The right choice determines performance, cost, and scalability of your AI applications.

The GPU Landscape 2026

NVIDIA H100 — The Current Standard

80 GB HBM3 memory, up to 3,958 TFLOPS (FP8)
Price: ~€30,000–40,000 per GPU (single purchase)
Cloud cost: ~€2.50–4.00/hour (on-demand)
Ideal for: Inference of medium to large models (up to 70B parameters)

NVIDIA H200 — More Memory, More Speed

141 GB HBM3e memory — nearly double the H100
30–40% faster inference through higher bandwidth
Price: ~€35,000–50,000 per GPU
Ideal for: Large models (70B+), long contexts, multi-modal

NVIDIA B200 (Blackwell) — Next Generation

192 GB HBM3e, FP4 support for efficient inference
Up to 2.5x faster than H100 for inference
Availability: Increasingly available from Q2 2026
Ideal for: New investments aiming for future-proofing

Alternatives

AMD MI300X: 192 GB HBM3, competitive price/performance
Google TPU v5p: Optimal for JAX/TensorFlow workloads on GCP
AWS Inferentia2: Cheapest option for pure inference workloads

Calculating Inference Costs

API-based (Managed)

Simplest approach — you pay per token:

Model	Input (1M tokens)	Output (1M tokens)
GPT-4o	~€2.50	~€10.00
Claude 3.5 Sonnet	~€3.00	~€15.00
Llama 3 70B (hosted)	~€0.60	~€0.80

Self-Hosted

Own GPU infrastructure — higher upfront costs, but cheaper at volume:

Cost calculation per request:

GPU hour: ~€3.00 (H100 cloud) or ~€0.80 (owned, amortized over 3 years)
Throughput: ~50 requests/second (Llama 70B, optimized)
Cost per request: ~€0.000016 (self-hosted) vs. ~€0.002 (API)

Optimization Strategies

Quantization: FP16 → INT8 → INT4 reduces memory by 50–75%, latency by 30–50%
Batching: Process multiple requests simultaneously — triple throughput
Model distillation: Train smaller models that imitate the large model
vLLM & TensorRT-LLM: Optimized inference engines with PagedAttention

Decision guide: Under 10,000 requests/day → API. Over 100,000 → evaluate self-hosted. In between → depends on the use case.

Previous lessonPrevious lesson Next lessonNext lesson