The Inference Era: How NVIDIA and AMD Are Fighting for the Next AI Goldmine

The Inference Era: How NVIDIA and AMD Are Fighting for the Next AI Goldmine

For most of the AI boom so far, the headlines have belonged to training. But somewhere between GPT-4 and the agent-driven applications now shipping inside every product team's roadmap, the center of gravity quietly shifted. Training is still expensive, but inference - actually running those models in production, every second, against real user traffic - is where the money is now being spent. And it is where the hardware fight is heating up the fastest. This piece is Part 1 of a 6-part deep dive into the inference hardware landscape, walking through the market shape, NVIDIA's dominance, AMD's catch-up, and the cracks in the GPU paradigm that are opening doors for challengers.

Yashwanth

Introduction

For most of the AI boom so far, the headlines have belonged to one phase of the workload: training. Companies bought clusters of NVIDIA H100s, racked them up by the thousands, and burned megawatts teaching frontier models to autocomplete the world. But somewhere between GPT-4 and the agent-driven applications now shipping inside every product team's roadmap, the center of gravity quietly shifted.

Training is still expensive, but inference - actually running those models in production, every second, against real user traffic - is where the money is now being spent. And it is where the hardware fight is heating up the fastest.

This is Part 1 of a 6-part series on the AI inference hardware landscape. We start with the market shape and the two incumbents - NVIDIA and AMD - and where the cracks in the GPU paradigm are opening doors for a new class of challengers.

Why Inference Became the New Gold Rush

The numbers tell the story:

  • The global AI inference market was valued at roughly USD 106 billion in 2025 and is projected to reach about USD 255 billion by 2030, growing at a CAGR of around 19.2%.
  • Hardware alone - GPUs, accelerators, and purpose-built inference chips - dominates that figure as cloud providers and enterprises race to deploy capacity.
  • North America accounts for nearly half the global market today, but Asia Pacific is growing at close to 20% CAGR thanks to expanding digital infrastructure across India, Korea, Japan, and Southeast Asia.

What drove this shift was not any single launch. It was the accumulating reality that every generative AI feature, every chatbot, every retrieval-augmented application, every coding assistant, and every image and video generation pipeline is paying an inference bill.

Unlike training, which is a one-time capital cost amortized across the lifetime of the model, inference is a per-request operating expense. Once you have ten million daily active users and each one consumes thousands of tokens, the bill compounds.

NVIDIA itself is expected to report roughly USD 49 billion in AI-related revenue in 2025 - a 39% jump year over year - and a growing share of that is inference workloads.

NVIDIA: The Incumbent That Built the Road Everyone Drives On

NVIDIA's position in inference is, in a word, dominant. The chips have stayed ahead, and the software around them has compounded for nearly two decades.

1. The H100 - The Production Workhorse

  • Launched on the Hopper architecture; became the default chip for serving LLMs in 2023-2024.
  • Introduced an FP8 Transformer Engine that effectively doubled throughput over FP16 with minimal accuracy loss.
  • Backed by the most mature software stack in the industry: TensorRT-LLM, Triton Inference Server, CUDA.
  • Every major inference framework - vLLM, SGLang, TGI - has its best optimizations on CUDA first.

2. The H200 - The Memory Upgrade

  • Same ~989 FP16 TFLOPS as H100, but ships with 141 GB of HBM3e memory and significantly higher bandwidth.
  • Delivers up to 1.9x inference speedup on Llama 2 70B vs H100 in NVIDIA's official benchmarks.
  • For long-context inference where the KV-cache balloons into tens of GBs per concurrent request, H200 became the practical workhorse.

3. Blackwell (B200 and GB200 NVL72) - The Big Leap

  • Utilizes FP4 precision on fifth-generation Tensor Cores, effectively doubling compute over FP8.
  • MLPerf 4.1 inference on Llama 2 70B: a single B200 hit ~10,755 tokens/sec server mode and 11,264 tokens/sec offline.
  • A single B200 is approximately 3.7-4x faster than a single H100.
  • Across longer-context workloads, B200 is also the cost efficiency leader in independent benchmarks - throughput advantage more than compensates for the higher hourly rate.

4. The Rubin Roadmap

  • Announced at CES 2026.
  • Extends the lineage with unified compute, memory, and networking targeted at long-context reasoning workloads.

Where NVIDIA's Pitch Starts to Fray

For all that, NVIDIA's inference dominance comes with costs that are increasingly hard to ignore, and that openings competitors are working to exploit.

1. CUDA Lock-In

  • CUDA is a proprietary platform tied exclusively to NVIDIA hardware.
  • 18+ years of compiler optimizations, hand-tuned libraries (cuDNN, cuBLAS, TensorRT), and millions of trained developers create a moat - but also a one-vendor dependency.
  • Migrating off NVIDIA once your code is full of CUDA kernels and TensorRT-LLM tricks is not a weekend project.

2. Price and Power

  • On-demand H100 SXM rates from neo-cloud providers sit around USD 2.00-3.50 per GPU-hour.
  • Hyperscale clouds (AWS, GCP, Azure) charge USD 4.00-8.00 for the same silicon.
  • B200 on-demand pricing typically runs USD 5.00-6.00 per hour.

3. Architectural Overfit

  • H100's strength is SIMT execution at massive thread parallelism with latency-hiding via warp scheduling.
  • Works extraordinarily well for dense, batched matrix multiplies in transformer attention.
  • Works less well for sparse, dynamic, MoE-heavy models that increasingly characterize the frontier.
  • Independent SemiAnalysis analysis: the GB200's per-Watt FLOPS improvement over H100 is closer to 47%, not the marketed 30x.

4. Supply Concentration

  • NVIDIA's grip gives it pricing power and creates geopolitical risk for buyers.
  • Sovereign AI procurement programs in Europe, Middle East, and Asia are increasingly explicit about wanting non-NVIDIA options.

AMD: The Credible Second Source That Just Got More Credible

AMD has spent the last three years climbing the inference hill. The journey has gone through three serious generations.

1. MI300X (CDNA 3, 2024)

  • 192 GB HBM3 memory, 5.3 TB/s bandwidth.
  • First AMD accelerator taken seriously for production inference, particularly memory-bound workloads.
  • Independent benchmarks: beats H100 in absolute performance and per-dollar performance on Llama 3 405B and DeepSeek V3 670B inference.
  • Catch: short-term rental markets favor NVIDIA because far more neo-clouds offer H100s than MI300Xs.

2. MI325X (Q4 2024)

  • Upgrade to 256 GB HBM3e, 6 TB/s bandwidth.
  • Positioned as a direct response to H200.
  • Timing was unfortunate: by Q2 2025 shipment, NVIDIA's B200 was already ramping and many customers skipped the upgrade.

3. MI350 Series - MI350X and MI355X (CDNA 4, 2025)

  • Built on TSMC 3nm.
  • Up to 288 GB HBM3e, 8 TB/s memory bandwidth.
  • Native FP4 and FP6 support.
  • AMD claims MI355X beats NVIDIA in like-for-like inference benchmarks up to 1.3x, leads by 1.13x in select training workloads.
  • The "35x inference improvement over MI300X" claim refers to a specific cherry-picked FP4-vs-FP16 comparison - read with skepticism.
  • Power consumption tops out at 1,400W TBP for the liquid-cooled MI355X (vs 750W for MI300X).

4. Helios (CES 2026)

  • AMD's rack-scale AI system, targeting both training and inference.
  • High-bandwidth memory and system-level optimization.
  • MI400 series on the roadmap; internal numbers show 2.3x training speed improvement over MI300 for image classification.

Where AMD Still Trails

The fundamental gap is software, not silicon.

  • ROCm 7 landed September 2025 with native Windows support and day-zero PyTorch integration.
  • For workloads built on PyTorch + vLLM or SGLang with no custom kernels, ROCm parity is now close to CUDA.
  • But "close to" is not "equal to":
    • CUDA still dominates: TensorRT-LLM, FlashAttention-3 (Hopper-specific), NVIDIA NIM containers, any pipeline with CUDA-specific custom kernels.
    • ROCm is competitive: memory-bandwidth-heavy inference, PyTorch + vLLM/SGLang with no custom kernels.
  • Rental market frictions: Far fewer neo-clouds offer MI300X / MI325X than H100s, which artificially inflates short-term rental rates for AMD hardware.

The Shape of the Fight

The headline story of 2025 and 2026 has been NVIDIA's lead narrowing, not collapsing:

  • AMD has caught up to within striking distance on raw silicon.
  • Open-source software is now production-viable for the most common inference patterns.
  • NVIDIA still wins on ecosystem maturity, breadth of optimization, and predictable performance, especially for cutting-edge inference engines that target Hopper-specific features.

But what neither incumbent has solved is the architectural question:

Are GPUs - designed originally for graphics, retrofitted for HPC, then retrofitted again for AI - actually the right shape of silicon for inference at scale?

Both NVIDIA and AMD answer yes, and back that answer with massive R&D budgets and impressive generational jumps. A small but growing set of challengers - Cerebras, Groq, SambaNova, and most relevantly for this series, Tenstorrent - answer no, and are betting their companies on different architectural foundations.

Coming Next

In Part 2, we turn from the chips themselves to the dollars: what does inference actually cost on NVIDIA versus AMD versus open-source self-hosted stacks today, and what tools are people using to drive that number down?

Sources include: MarketsandMarkets, Polaris Market Research, SemiAnalysis, NVIDIA MLPerf 4.1 submissions, Tom's Hardware coverage of AMD Advancing AI 2025, Spheron and GMI Cloud pricing pages, SDxCentral analysis on ROCm, Voltage Park and TRG Datacenters technical comparisons.

GPU NET


Our Official Channels:

Website | Twitter | Telegram | Discord

More Stories

Arrow leftArrow left
Try our Planetary Grid of Compute Now!