NVIDIA vs AMD vs Tenstorrent: An Architectural Deep Dive on Inference

This piece tries to do the most: line up NVIDIA, AMD, and Tenstorrent side by side at the architectural level - execution model, memory hierarchy, interconnect, software stack, and the strategic shape of each company's bet - and explain why Tenstorrent's choices, while less proven, are particularly well-fitted to the direction inference workloads are heading. The question is not 'which is best today' but 'which architectural lineage is best matched to where inference is going, and why does that matter for buyers and operators?' This is Part 5 of the AI Inference Hardware series.

Yashwanth

June 13, 2026

Introduction

This piece is the longest in the series because it tries to do the most: line up NVIDIA, AMD, and Tenstorrent side by side at the architectural level - execution model, memory hierarchy, interconnect, software stack, and strategic shape - and explain why Tenstorrent's choices, while less proven, are particularly well-fitted to where inference is heading.

As of mid-2026, NVIDIA still wins most inference workloads on raw deployability, ecosystem maturity, and predictable performance. The question this article is asking is not "which is best today" but "which architectural lineage is best matched to where inference is going, and why does that matter for buyers and operators?"

Execution Model: SIMT, SIMD-Heavy MIMD, or Mesh MIMD

The single most consequential architectural difference between the three companies is how they organize parallel execution.

1. NVIDIA Hopper and Blackwell - SIMT

Single Instruction, Multiple Thread execution.
A streaming multiprocessor schedules warps of 32 threads that execute in lock-step.
Hardware manages divergent branches via per-thread masks.
Warp scheduling hides memory latency by switching active warps when one stalls on DRAM.
H100: 132 SMs, fourth-generation Tensor Cores, FP8 Transformer Engine, ~256 KB shared memory / L1 cache per SM.
B200: Doubles the chiplet count (two dies in one package), adds fifth-generation Tensor Cores with native FP4 support.

Strengths:

Extraordinarily well-suited to dense, batched matrix multiplication on regular tensors.
Latency-hiding through warp scheduling means programmers don't have to think carefully about every memory access.
CUDA toolchain abstracts the parallelism so kernels can run across GPU generations.

Weaknesses:

Wasteful for irregular workloads. When threads in a warp diverge (e.g., MoE expert routing), some lanes idle while others compute.
Latency hiding works less well for irregular memory access patterns.
The abstraction tax means programmers cannot easily get below CUDA to optimize specific data movement patterns.

2. AMD CDNA 3 and CDNA 4 - SIMT (wider warps)

Same SIMT model with 64-thread warps (vs NVIDIA's 32).
MI300X: 304 compute units across multiple chiplets.
MI355X: Fully featured FP4 and FP6 support in CDNA 4.

Chiplet design:

3D packaging: Accelerator Compute Dies (XCDs) fused with I/O Dies (IODs) using 3D stacking.
2.5D packaging: IODs connected to each other and to twelve HBM3E stacks.

Memory bandwidth and capacity are the standout figures:

MI300X: 192 GB HBM3 @ 5.3 TB/s
MI325X: 256 GB HBM3E @ 6 TB/s
MI355X: 288 GB HBM3E @ 8 TB/s

Strengths: Very similar to NVIDIA - outstanding throughput on dense matrix workloads, mature kernel libraries.

Weaknesses: Software ecosystem (ROCm) is meaningfully behind CUDA in tooling depth, kernel optimization breadth, and the precise tuning that delivers the last 20% of advertised performance.

3. Tenstorrent Wormhole and Blackhole - Mesh MIMD

A clean architectural departure:

No SIMT. Each Tensix core is a MIMD compute tile with its own instruction stream.
Inside a Tensix core: five small RISC-V "baby" cores coordinate matrix and vector engines plus pack/unpack units, with 1.5 MB of local SRAM.
Across a chip: Tensix cores arranged in a 2D mesh connected by a network-on-chip.
Across chips: mesh extends via 400 Gbps Ethernet (Wormhole) or 800G QSFP-DD (Blackhole) - no PCIe switches, no proprietary fabric.

The radical part:

No cache hierarchy. Data lives in DRAM, in another core's SRAM, or in this core's SRAM - software moves it explicitly via DMA.
Memory access is deterministic - every DRAM read takes a known number of cycles.
No hardware multithreading. Tensix cores operate via cooperative processing, with explicit software pipelining replacing warp-based latency hiding.

Why this matters for inference:

Transformer inference has a very specific data flow pattern: load weights once, stream activations through layer by layer, write KV-cache back.
Data movement pattern is highly structured.
SIMT's flexibility is wasted; deterministic explicit data movement is exactly the right primitive.
Mesh topology means scaling out doesn't require an expensive switch hierarchy.
Absence of caches means the compiler can schedule data placement perfectly - no eviction surprises.

Trade-off: Programmability. Writing a Tenstorrent kernel in TT-Metalium is genuinely harder than writing a CUDA kernel - you have to think about which SRAM lives where, when data flows across the NoC, and how to overlap compute with movement.

Memory Hierarchy: Where the Inference Money Is Made or Lost

Inference is bottlenecked far more often by memory than by compute. The KV-cache for long-context inference can be larger than the model weights themselves.

NVIDIA

H100 SXM: 80 GB HBM3 @ 3.35 TB/s
H200: 141 GB HBM3e @ 4.8 TB/s
B200: up to 192 GB HBM3e at higher bandwidth
Bandwidth gains drive performance: H200's 1.83-2.14x inference speedup over H100 on long-context Llama-class workloads comes from HBM3e.
H100 drops 64% of throughput as context scales; H200 holds up better (47% drop).

AMD

AMD has positioned memory as its primary lever:
- MI300X: 192 GB HBM3 @ 5.3 TB/s
- MI325X: 256 GB HBM3E @ 6 TB/s
- MI350 series: 288 GB HBM3E @ 8 TB/s
Largest single-chip memory pools in production AI hardware today.
Why MI300X became the de facto choice for memory-bound inference despite the ROCm gap.

Tenstorrent

Wormhole: 12 GB GDDR6 per chip @ 288 GB/s.
Blackhole: 28-32 GB GDDR6 per chip @ ~1 TB/s estimated.
Dramatically behind both NVIDIA and AMD on per-chip memory capacity and bandwidth.

Strategic answer - distributed on-chip SRAM and mesh-scaled DRAM pooling:

Each Blackhole chip carries 180-210 MB of SRAM distributed across the Tensix mesh.
Across a Galaxy of 32 Blackhole chips, that aggregates to 6.2 GB of on-chip SRAM at 2.9 PB/s - orders of magnitude faster than any DRAM.
Total DRAM in a Galaxy: 1 TB at 16 TB/s aggregate, accessed through the on-chip Ethernet mesh and treated logically as a unified memory pool.

The cost angle:

GDDR6 is much cheaper than HBM3/HBM3e - hundreds of dollars per chip difference.
Tenstorrent's bet: aggressive use of distributed SRAM plus mesh DRAM pooling delivers competitive or better effective bandwidth at substantially lower cost per chip.
Savings compound at rack and supercluster scale.

Interconnect: The Part Most People Underestimate

Interconnect is where data center economics live or die.

NVIDIA

NVLink 4.0 at 900 GB/s bidirectional aggregate per GPU within HGX/DGX.
NVSwitch for all-to-all communication.
InfiniBand or RoCE for multi-node networking.
Fast, expensive, proprietary.
NVSwitch hardware and associated licensing add meaningful cost.
GB200 NVL72: 72 Blackwell GPUs in one rack tied together by fifth-gen NVLink at 1.8 TB/s per GPU.

AMD

Infinity Fabric serves a similar role inside 8-way MI300X servers.
Faster generationally with MI355X but still proprietary.

Tenstorrent

Integrated 400 Gbps Ethernet directly onto each Wormhole chip.
800 Gbps Ethernet onto each Blackhole chip.
Scaling out = passive QSFP-DD cable from one chip to another. No NVSwitch, no InfiniBand switches, no proprietary fabric.
On-chip 2D mesh and off-chip Ethernet form one continuous logical network.
Exposed to software as one big mesh of Tensix cores.

Specific consequence at scale:

Tenstorrent Galaxy: 32 Blackhole chips on a single board with 56 x 800G Ethernet ports (11.2 TB/s).
4 Galaxies form a "quad."
Quads connect into superclusters via cabled all-to-all topology - idle quads repurposable as switches.
Tenstorrent on record: "Critically, there are no Ethernet switches anywhere in the design."

Tenstorrent's TT-Fabric specification claims a 10x TCO advantage for AI data center design from this approach. Number is from Tenstorrent's own modeling - the structural argument (no switch tax, no proprietary fabric tax) is sound regardless.

Software Stack

1. NVIDIA

CUDA at the bottom
cuDNN, cuBLAS for linear algebra
TensorRT-LLM for optimized inference
Triton Inference Server for serving
NIM containers for one-click deployment of pre-optimized models
Rich third-party ecosystem (vLLM, SGLang, TGI, Ollama) targets CUDA first

Depth of optimization at every layer is unmatched. Cost: total vendor lock-in.

2. AMD

ROCm
HIP (CUDA-like C++ extension) at lower level, with translation layers
MIOpen for linear algebra
Compatibility with PyTorch, vLLM, SGLang is solid
Meta production partnership has driven significant maturity
Open-source in name and increasingly in practice

3. Tenstorrent

Fully open-source Apache 2.0 across every layer:

TT-Metalium - Bare-metal SDK, OpenCL-like C++ interface, direct access to RISC-V cores, NoC, matrix and vector engines.
TT-NN - Operator library with a PyTorch-like Python API.
TT-Forge - MLIR-based compiler bridging PyTorch, JAX, ONNX to hardware. Claimed support for 800+ model variants in CI.
TT-LLK - Low-level kernel layer.

Maturity ranking: NVIDIA gold standard → AMD production-viable for common patterns → Tenstorrent rapidly improving but earliest in out-of-the-box experience.

Strategic Shape: Where Each Company Is Betting

NVIDIA

Betting that the GPU paradigm has decades more headroom, that CUDA's compounding ecosystem advantage is unassailable, and that vertical integration (chips + DGX systems + InfiniBand + NIM + software) will keep enterprise customers locked in.

Rubin roadmap (CES 2026) doubles down on unified compute, memory, and networking for long-context reasoning workloads.

AMD

Betting that being a credible second source - better memory, cheaper hardware, increasingly open software - is enough to take meaningful share from NVIDIA, especially in memory-bound inference.

MI400 roadmap, Helios rack-scale platform, and the Meta partnership all point in this direction.

Tenstorrent

Betting on architectural divergence. Not "better GPUs" but "the post-GPU architecture for AI": mesh-based MIMD, open RISC-V, open software, no switch hierarchy, distributed SRAM, scale-out by Ethernet.

The bet: inference workloads are becoming more diverse, more data-movement-dominated, and more sensitive to total cost of ownership - and a clean-sheet architecture purpose-built for those properties beats incremental improvements on a paradigm originally designed for graphics.

Why This Matters for Inference Specifically

The argument for Tenstorrent on inference is concrete:

1. Mixture-of-Experts Models

Route different tokens to different experts.
On a SIMT GPU, this causes warp divergence and underutilization.
On a MIMD mesh, each core can independently run its assigned expert without lock-step coordination.
DeepSeek V4-class models map naturally onto Tenstorrent's architecture.

2. Long-Context Inference

Bottlenecked by KV-cache memory traffic.
Tenstorrent's distributed SRAM acts as a fast scratchpad close to compute, reducing DRAM traffic.
On-chip mesh allows KV-cache to be spread across many chips' SRAM rather than centralized in HBM.

3. Agentic Workloads

Branching prompts, speculative decoding, varying batch composition.
Benefit from the predictable latency of cache-free, software-scheduled memory access.
SGLang and similar engines target exactly this pattern.

4. Cost Per Token at Scale

GDDR6 (not HBM) + no switch fabric (Ethernet) + open-source software (no licensing) drives BOM costs lower per FLOP than equivalent NVIDIA or AMD systems.
At rack and supercluster scale, this compounds.

5. Sovereign Deployment

Procurement criterion, not technical one - but favors open hardware + open ISA + open software.
Tenstorrent is the only player at meaningful scale that hits all three.

The Honest Counterpoint

None of this means Tenstorrent wins automatically:

The software stack maturity gap is real.
The per-chip memory disadvantage is real.
The benchmark coverage compared to NVIDIA is sparse.

What it means is the architectural choices are well-fitted to where inference workloads are heading - and as workloads diverge from the dense-matrix-multiply pattern GPUs were designed for, Tenstorrent's bet becomes increasingly defensible.

Coming Next

In Part 6 - the final piece in this series - we look at proof-of-concept deployments, named companies actively using Tenstorrent, and what the published cost and performance numbers say about the real-world impact for buyers.

Sources: NVIDIA H100, H200, B200 product briefs; AMD Instinct MI300X, MI325X, MI355X technical disclosures; Tenstorrent docs.tenstorrent.com, tt-metal and tt-forge GitHub repositories, TT-Fabric architecture documentation; SemiAnalysis Blackwell TCO analysis; Spheron and GMI Cloud inference benchmarks; "HetGPU" (arXiv 2506.15993); Moor Insights & Strategy Tenstorrent analyst note; ASPLOS 2025 Blackhole microbenchmarking paper.

GPU NET

Our Official Channels:

Website | Twitter | Telegram | Discord

Understanding BERT: A State of the Art Model for NLP Using Deep Bidirectional Transformers

BERT recently got popular after its debut in 2018, courtesy of Google AI Language, short for Bidirectional Encoder Representations from Transformers. This new tool has become super important in the world of AI, especially for understanding human language. It’s like having a Swiss army knife for language related challenges, capable of handling tasks ranging from understanding sentiments in text to identifying important names and phrases.

Sujal Sripathi

July 16, 2024

Community Program

Assessing Large Language Models for Program Synthesis

Can big computer programs make new ones? Some experts think they can, especially the really big ones. These programs are great at understanding language and creating complex computer code. People who know a lot about coding are impressed because these programs can write difficult programs easily. It shows how smart computers have become at understanding language and making new things with it. This is where prompt engineering comes in. Engineers use special instructions or prompts to help these programs learn to do cool things like creating new computer programs. By guiding them with exact directions, engineers make sure these programs can comprehend & write complex code right.

Sujal Sripathi

July 12, 2024

Community Program

Compute Is Already an Asset Class. Tokenization Decides Who Gets to Own It.

Wall Street spent three years quietly rebuilding GPUs into investment-grade collateral. Tokenization is the layer that decides whether you're on the cap table or watching from outside and GPUnet's RWA Pool puts real GPU hardware within reach.

GPUNET

July 21, 2026

AI Inference

Tenstorrent in the Real World: Benchmarks, Customers, and the Inference Bet That's Starting to Pay Off

Across five pieces, we have built up a picture: an inference market growing from USD 106 billion in 2025 to a projected USD 255 billion by 2030; a NVIDIA-dominated landscape with CUDA lock-in and pricing pressure; an AMD that has caught up on silicon but still trails on software; and a Tenstorrent that is betting on architectural divergence to break the GPU paradigm for inference. The question for this final piece is the only one buyers actually care about: is it working? What do the published benchmarks say, who is buying or licensing, and what is the real-world impact on user-facing inference workloads? This is the closing Part 6 of the AI Inference Hardware series.

Yashwanth

June 14, 2026

AI Inference

For most of the AI boom so far, the headlines have belonged to training. But somewhere between GPT-4 and the agent-driven applications now shipping inside every product team's roadmap, the center of gravity quietly shifted. Training is still expensive, but inference - actually running those models in production, every second, against real user traffic - is where the money is now being spent. And it is where the hardware fight is heating up the fastest. This piece is Part 1 of a 6-part deep dive into the inference hardware landscape, walking through the market shape, NVIDIA's dominance, AMD's catch-up, and the cracks in the GPU paradigm that are opening doors for challengers.

Yashwanth

June 9, 2026

Tutorials

GPU Quest: Road to TGE

GPU.net is advancing decentralized computing by enabling GPU resource sharing, and its upcoming Token Generation Event (TGE) marks a significant milestone. The “Road to TGE” campaign on token.gpu.net provides a structured way for participants to earn rewards and engage with the project. This overview explains the campaign, its components, and how you can get involved in a professional and straightforward manner.

Surya Ranjith

May 11, 2025

Tutorials

GPU SUBNETS - A New Era

In a world where centralized GPU computing is expensive and restrictive, Subnets on GAN Chain offer a decentralized revolution. By connecting creators, users, and investors to a global pool of GPU resources, Subnets deliver affordable, scalable, and community-governed computing power. With tools to simplify project deployment, incentives for participation, and AI-optimized resource allocation, GPU.NET is not just solving today's GPU challenges — it's building the future. Join the movement: create, innovate, and grow with Subnets on GAN Chain.

Surya Ranjith

May 10, 2025

Provider Guide

Complete Guide on running a GPU Provider Nodes

This guide aims to minimize the friction in using documentation, providing you with a streamlined approach to set up your Provider GPU node. We'll walk you through the essential steps, ensuring you gather all the correct procedures effortlessly. With this guide, you'll have a clear path to running your Provider node efficiently. Let's dive into the steps and make the setup process as smooth as possible.

DJAL

August 12, 2024

Validator Guide

Complete Guide on running a GPU Validator Node

This guide aims to minimize the friction in using documentation, providing you with a streamlined approach to set up your validator GPU node. We'll walk you through the essential steps, ensuring you gather all the correct procedures effortlessly. With this guide, you'll have a clear path to running your validator node efficiently. Let's dive into the steps and make the setup process as smooth as possible.

DJAL

August 12, 2024

Community Program

Large Multimodal Models (LMMs) vs Large Language Models (LLMs)

Large multimodal models (LMMs) are a big change because they can handle different types of data like text, images, and audio. But they are complex and need a lot of data, which can be tricky at times. From the start, it was evident that AI would need to be multifunctional and serve as a single platform for various purposes, and LMM exactly is that.

Sujal Sripathi

August 9, 2024