The Real Cost of AI Inference in 2026: A Practical Breakdown
If you take one thing from this piece, it should be this: the headline GPU hourly rate is almost never the right number to optimize. What matters is cost per million tokens, and that number is shaped by three multipliers - the GPU itself, the serving stack (vLLM, TensorRT-LLM, SGLang), and the precision and batching strategy you run. Get all three right and the same workload that costs USD 5 per million tokens can cost USD 0.20. This is Part 2 of the AI Inference Hardware series, walking through real cloud GPU rates, the cost-per-million-tokens math, the serving stacks that actually move the needle, and the self-host vs API breakeven point in 2026.
Introduction
In Part 1, we walked through the inference market and the silicon fight between NVIDIA and AMD. This piece is about the boring, important number that determines whether any of that hardware actually makes sense for your workload:
What does it cost to serve a million tokens?
If you take one thing from this piece, it should be this: the headline GPU hourly rate is almost never the right number to optimize. What matters is cost per million tokens (CPM), and that number is shaped by three multipliers:
- The GPU itself
- The serving stack (vLLM, TensorRT-LLM, SGLang)
- The precision and batching strategy
Get all three right and the same workload that costs USD 5 per million tokens can cost USD 0.20.
Cloud GPU Pricing Today
Pricing has fallen sharply since 2024, mostly because of two pressures. First, AWS cut H100 pricing 44% in June 2025, pulling the entire market down. Second, dozens of specialized "neo-cloud" providers - RunPod, Lambda Labs, Spheron, GMI Cloud, Vultr, Crusoe, Together AI - entered the market, competing on price for the same NVIDIA inventory.
Going rates as of mid-2026:
1. NVIDIA H100 SXM
- Neo-cloud on-demand: USD 1.99 - 3.50 per GPU-hour.
- GMI Cloud: USD 2.00 - 2.10. Atlas Cloud: USD 2.95. RunPod community: USD 1.99 - 3.29.
- Hyperscale clouds (AWS, GCP, Azure): USD 4.00 - 8.00 for the same GPU.
- Spot pricing on Spheron: as low as USD 1.25/hr for interruption-tolerant workloads.
2. NVIDIA H200
- On-demand: USD 2.30 - 4.54.
- GMI lists USD 2.60, Nebius around USD 2.30 on commit.
- Premium over H100 justified primarily by 141 GB of HBM3e memory - directly translates to fitting larger models or higher concurrent batch sizes on a single chip.
3. NVIDIA B200
- On-demand (Q1 2026): USD 4.99 (Lambda) to USD 6.02 (Spheron) per hour.
- Spot pricing: around USD 2.12/hr.
- Despite higher hourly rate, frequently wins on cost-per-million-tokens because the throughput premium outpaces the price premium for long-context inference.
4. AMD MI300X / MI325X
- Harder to pin down because fewer providers carry it.
- Vultr: MI300X at USD 1.85/hr. Crusoe: USD 3.45.
- MI325X short-term rentals: USD 2.75 - 3.00/hr.
- For hyperscalers and enterprises that own their fleet, MI300X can beat H100 on dollars-per-token.
- For short-term renters, it usually does not - thin rental market inflates per-hour rates.
5. A100 80GB (Legacy but Relevant)
- Spot pricing on Vast.ai: as low as USD 0.29/hr.
- On-demand: USD 1.20 - 1.50/hr.
- Catch: A100 lacks FP8 support, so its lower hourly rate often loses on cost-per-token to an H100 running FP8.
Translating GPU Rate Into Cost Per Million Tokens
The conversion formula is straightforward but the variables matter:
Cost per million tokens (CPM) = (cluster $/hour) / (tokens_per_sec × 3600 / 1,000,000)
Concrete examples:
- H100 SXM at USD 2.90/hr running Llama 4 Scout 17B via vLLM at 4,200 tok/s → CPM ≈ USD 0.19
- H100 running 70B at FP8 with optimal batching (1,500-2,500 tok/s) at USD 2.90/hr → CPM ≈ USD 0.32 - 0.54
- B200 on-demand at USD 5.89/hr with throughput premium → CPM ≈ USD 0.42 per million (beats H100 PCIe at USD 0.47 despite higher hourly rate)
- B200 spot at USD 2.12/hr → CPM drops to roughly USD 0.15 per million (cost leader for checkpoint-tolerant workloads)
- H200 at USD 2.60/hr with 1.83-2.14x throughput advantage over H100 on long-context Llama → USD 0.70 per million for VRAM-headroom-hungry workloads
These are wholesale GPU-rental numbers. They do not yet include the engineering effort to keep utilization high, which is the single biggest determinant of whether self-hosting beats API pricing.
The Serving Stack Matters As Much As the Silicon
Three open-source frameworks dominate self-hosted inference:
1. vLLM
- Most widely deployed.
- PagedAttention (efficient KV-cache management) + continuous batching beats static batching by 2-4x on real traffic.
- Runs on CUDA and increasingly on ROCm.
- OpenAI-compatible API.
2. NVIDIA TensorRT-LLM
- NVIDIA's optimized inference engine.
- Deep Hopper-specific kernels (FlashAttention-3 included).
- Most aggressive quantization support.
- Highest tokens-per-second numbers on NVIDIA hardware, at the cost of operational complexity and CUDA lock-in.
3. SGLang
- Newer, optimized for complex prompting patterns and agent workloads.
- Pulls ahead of vLLM in scenarios with shared prefixes or branching prompts.
The choice between them, combined with quantization strategy, can swing cost-per-million-tokens by 10-12x for the same workload on the same hardware.
Quantization multipliers:
- FP8 on Llama 70B halves VRAM use (70 GB weights vs 80 GB on a single H100) - one chip can now serve a model that would otherwise need two.
- INT4 cuts another factor of two.
- Continuous batching, speculative decoding, and right-sizing the GPU to the model are all multiplicative on top.
The "Open-Source Self-Hosted" Path
The math for self-hosting vs calling an API has shifted decisively in 2026, but not always in the direction technical teams expect.
Budget Tier APIs Are Aggressively Cheap
- Together.ai, Fireworks AI, DeepInfra, Hyperbolic, SambaNova now offer Llama 3.2 3B at USD 0.06 per million input tokens.
- DeepSeek-R1-Distill-Llama-70B: USD 0.70 - 1.05 per million tokens blended.
- Google Gemini Flash-Lite leads the budget tier at USD 0.075 / 0.30 per million input/output.
Self-Hosting Breakeven Points
Different sources put the breakeven at:
- 50 - 100 million tokens per month vs budget APIs
- 5 - 10 million tokens per month vs premium APIs (GPT-5, Claude Sonnet)
- Around 256 million tokens per month for Llama 70B vs GPT-5 pricing
Below those thresholds, engineering overhead - DevOps staff, deployment pipelines, monitoring, model updates every 6-8 weeks - typically erases the GPU cost savings.
Real numbers from one healthcare AI team: USD 4,300 monthly Lambda Labs GPU + USD 6,100 in engineering hours = USD 10,400/month total, against USD 1,870 for the equivalent API spend. They paid 5.6x more for the privilege of self-hosting.
When Self-Hosting Wins
The breakeven flips decisively above roughly 500 million tokens per day (about 15 billion tokens/month):
- Self-hosting delivers something like 5x cost savings over premium API pricing.
- For organizations at that scale, self-hosting saves USD 5M - 50M per year.
- Engineering overhead becomes rounding error against those savings.
The Hidden Cost: KV-Cache and Concurrency
A subtle but critical point that surprises many teams when they move from prototype to production:
KV-cache memory often dominates VRAM use at production concurrency, and that changes which GPU is actually cost-optimal.
The KV-cache formula per request:
KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element
For Llama 2 70B (80 layers, 8 KV heads, 128 head_dim) at FP16 with 4K context:
- ~0.4 GB per concurrent request.
- At 200 concurrent requests → 80 GB of cache alone (more than the model weights).
This is why H200 at USD 2.60/hr often beats H100 at USD 2.00/hr once concurrency gets past a certain point - the H100 simply runs out of room. When you scale to 16K or 32K contexts, the cache explodes and the equation shifts even further toward memory-rich chips.
This is also the structural reason memory bandwidth matters at least as much as raw FLOPs for inference, and why AMD's MI300X (192 GB HBM3) and MI355X (288 GB HBM3e) are credible alternatives even with weaker software.
Tools the Cost-Conscious Are Actually Using
A handful of inference-cost-reduction tools and patterns have stabilized into common practice:
Inference Stack
- vLLM with continuous batching as the default serving layer for self-hosted deployments.
- TensorRT-LLM for the absolute highest throughput on NVIDIA.
- AWQ, GPTQ, SmoothQuant for INT4 / INT8 quantization.
- FP8 quantization wherever the hardware supports it.
- Speculative decoding with a smaller draft model paired against the larger target model - often cuts tokens/sec cost by 2-3x.
- Prompt caching (notable in Anthropic's API pricing) for workloads with large repeated system prompts.
- KV-cache offloading to CPU memory for very long contexts.
Cloud Management
- Spot instances for batch and async workloads (40-60% savings).
- Reserved / committed pricing for predictable baseline traffic (up to 35% off on Nebius, similar on most neo-clouds).
- Aggressive autoscaling tied to actual concurrent request counts rather than peak provisioning.
What This Looks Like in Summary
For a few million tokens/day on a small-to-mid model
- Right answer is almost always an API.
- Dollar savings from self-hosting are dwarfed by engineering cost and operational risk.
For tens of millions of tokens/day on a 7B-32B model
- Single H100 SXM at USD 2.10/hr on a specialized neo-cloud, running vLLM with FP8, is a sweet spot.
- Expect CPM around USD 0.20 - 0.40 depending on context length and concurrency.
For 70B+ models or long-context workloads at scale
- H200 at USD 2.50-2.60/hr or B200 at USD 5-6/hr is typically the better cost-per-token choice.
- Counter-intuitive: the more expensive chip is often cheaper per token.
For sovereign or regulated workloads
- Cost-per-token calculation is mostly irrelevant; the constraint is the constraint.
- This is increasingly where AMD MI300X / MI325X and non-GPU alternatives (Tenstorrent, Cerebras, Groq, SambaNova) find natural footholds.
Coming Next
That last category - buyers who actively want a non-NVIDIA, non-AMD path for reasons beyond pure dollar cost - is where Part 3 picks up. We introduce Tenstorrent, the Jim Keller-led company building inference chips on open RISC-V and a fully open software stack.
Sources: GMI Cloud pricing pages, Spheron Blog GPU pricing comparison (March 2026 + April 2026), Awesome Agents GPU pricing comparison, SemiAnalysis AMD vs NVIDIA inference benchmark analysis, MyEngineeringPath LLM token pricing guide, Featherless LLM API pricing comparison, BrainCuber self-hosted cost analysis, AI Pricing Master self-hosting TCO analysis.





![GPUNET Verifiable Exchange: The Next Frontier for $GPU, Nodes and Ecosystem [TEASER]](https://i.ibb.co/Z1JWjN7r/Article-Cover.png)










