Inside the Tenstorrent Chips: Grayskull, Wormhole, Blackhole, and Galaxy

A company can have the most compelling thesis in the world, but it lives or dies on the silicon. This piece walks through the actual Tenstorrent chips - what they are, what they cost, and what the published, mostly open-source benchmarks say about how they stack up against NVIDIA and AMD. We cover the full product ladder from Grayskull to Galaxy Blackhole, explain what makes a Tensix core architecturally different, and walk through the published benchmarks honestly - including where the numbers come from and what caveats apply. This is Part 4 of the AI Inference Hardware series.

Yashwanth

June 12, 2026

Introduction

A company can have the most compelling thesis in the world, but it lives or dies on the silicon. In Part 3 we covered Tenstorrent's strategy and team. This piece walks through:

The actual chips - what they are and what they cost
What makes a Tensix core architecturally different
What the published benchmarks say honestly, including caveats

The Tenstorrent Product Ladder

Tenstorrent has three chip generations in market, each with multiple PCIe board variants, plus rack-scale systems built on top.

1. Grayskull (Gen 1)

First-generation chip, announced 2021.
Developer cards: e75 (USD 599) and e150 (USD 799).
120 Tensix cores, mostly used as evaluation hardware.
Largely superseded by Wormhole now.
Historically significant as the first commercial chip to ship the Tensix architecture.

2. Wormhole (Gen 2, mid-2024)

The chip with the most mature TT-Metal documentation, tutorials, and verified model support as of Q1 2026.

Wormhole n150 (single-chip PCIe card):

72 Tensix cores
108 MB distributed SRAM
12 GB GDDR6 at 288 GB/s
160 W TDP
262 FP8 TFLOPS at 1 GHz

Wormhole n300 (dual-chip board):

128 Tensix cores
192 MB SRAM
24 GB GDDR6 at 576 GB/s aggregate
300 W TDP
466 FP8 TFLOPS

Each Tensix core contains five baby RISC-V cores plus the matrix and vector engines.

3. Blackhole (Gen 3, 2025)

Tenstorrent's strongest current chip and the basis for the Galaxy Blackhole rack systems. Launched at Tenstorrent Dev Day. Manufactured on Samsung 6nm.

PCIe board variants:

p100a - USD 999, no Ethernet, active-cooled for desktop, 28 GB GDDR6, 120 Tensix cores.
p150a - USD 1,399, with Ethernet, active-cooled, 32 GB GDDR6.
p150b - USD 1,399, passive-cooled for rack servers.

Each board has four passive QSFP-DD 800G ports for linking cards together. Each chip has 16 "big" RISC-V cores plus the Tensix mesh.

Published peak compute: 745 TFLOPS FP8 per Tenstorrent's specifications (not yet independently confirmed at scale).

Total Ethernet bandwidth: 1 TBps across ten 400 Gbps links.

Important note on Blackhole specs: In February 2026, Tenstorrent issued firmware v19.5.0 which reduced the published Tensix core count on the p150 cards from 140 to 120, and SRAM from 210 MB to 180 MB. Existing users should expect a 1-2% performance drop. The change was communicated by email and disclosed on GitHub. Published BLOCKFP8 figure on p150 moved from 774 TFLOPS to 664 TFLOPS. Tenstorrent did not publicly elaborate on the reason, though the most plausible read is silicon yield management on the 6nm process. This is the kind of transparency-after-the-fact that a fully open hardware company subjects itself to, and worth noting honestly.

4. TT-QuietBox

Liquid-cooled desktop workstation built around 4 Blackhole processors.
Priced at USD 11,999 (base) to USD 15,000 (upgraded with 4 Wormhole n300s + AMD EPYC 8124P 16-core CPU under liquid cooling).

5. Galaxy Wormhole

6U rack system with 32 Wormhole processors on a single board.
Connected through the on-chip Ethernet mesh without any PCIe switch in the data path.
Building block for multi-machine Tenstorrent deployments using TT-Fabric.

6. Galaxy Blackhole

The newest rack system, announced at TT-Deploy in 2026. Air-cooled.

Base configuration:

32 Blackhole chips
23 PFLOPS of FP8 compute
6.2 GB on-chip SRAM at 2.9 PB/s aggregate
1 TB DRAM at 16 TB/s aggregate
56 x 800G Ethernet ports for up to 11.2 TB/s scale-out bandwidth

Pricing:

Starts at USD 110,000 per Galaxy unit.
Supercluster configurations of 4 to 36 Galaxies available.
4-Galaxy supercluster base: USD 440,000.

What Makes a Tensix Core Different

The Tensix core is the unit of computation in Tenstorrent silicon, and understanding it is the key to understanding why benchmarks against GPUs do not behave the way you might expect.

Internal Structure

A single Tensix core contains:

5 small RISC-V cores ("baby RISC-V")
Matrix engine (FPU)
Vector engine (SFPU)
Data pack/unpack units
1.5 MB of dedicated local SRAM

The five baby cores each handle a distinct stage of the compute pipeline:

Ingestion of operands from the network-on-chip
Unpacking
Matrix or vector compute
Packing
Output back to the NoC

They are not general-purpose multi-threaded cores in the CPU sense - they are specialized control and dataflow processors that coordinate the matrix and vector units.

Key Architectural Choices

No SIMT execution model. No warp of 32 threads executing in lock-step.
No cache hierarchy. The 1.5 MB of SRAM per core is software-managed. Data movement is explicit via DMA operations. Deterministic memory access latency - hugely valuable for compilers that want to schedule data movement precisely.
No hardware multithreading hiding latency. Each baby RISC-V core is single-threaded. The architecture works through software pipelining and asynchronous I/O rather than warp scheduling.
Native 32x32 tile granularity - deliberate choice optimized for matrix multiplication and convolution.

Chip-Level Organization

Inside the chip, Tensix cores are arranged in a 2D mesh. Blackhole has:

A grid of 140 Tensix cores (now 120 after the firmware change on p150)
24 GDDR6 memory controllers
I/O and management units: Ethernet, PCIe, Arc microcontroller, larger RISC-V control CPUs

The network-on-chip (NoC) handles data movement between Tensix cores and between cores and memory controllers, with high bandwidth and predictable routing.

Why This Matters for Inference

Explicit data movement model maps cleanly onto the data flow of transformer inference (load weights into SRAM, stream activations through, write KV-cache back).
Distributed SRAM acts as a fast scratchpad that reduces DRAM traffic.
On-chip Ethernet means scaling out to dozens of chips does not require a switch hierarchy that becomes its own bottleneck.

Open-Source Benchmarks: Where the Data Lands

The honest summary: independent third-party benchmarks of Tenstorrent hardware are still sparse. Most published numbers come from one of three sources:

Tenstorrent's own marketing and engineering blog posts
Academic papers with microbenchmarks
Independent technology blogs (Spheron, Moor Insights & Strategy, EE Times)

Within those constraints, headline numbers worth taking seriously:

1. Wormhole Galaxy on Llama 70B

Per Tenstorrent's own benchmarking: 4,000-5,000 tokens/sec on Llama 70B at batch 32 on TT-Metal.
For comparison: 8x H100 SXM5 node running vLLM achieves 2,500-3,500 tok/s at the same batch size.
Caveat: H100 numbers reflect a full production serving stack (vLLM, PagedAttention, continuous batching, queue management). Galaxy numbers are controlled single-model benchmark runs without an equivalent production serving layer.

2. Galaxy Blackhole on DeepSeek R1 (the headline number)

Demonstrated at TT-Deploy in 2026:

350+ tokens per second per user on DeepSeek R1 671B
Across 16 Galaxy Blackhole units (512 Blackhole chips total)
Batch size 32, ~4-second time-to-first-token on 100K context
Running prefill and decode on the same hardware
Cost claim: USD 6 per million tokens, vs implied USD 30/M on NVIDIA - a 5x TCO advantage

The architectural fit story matters here: DeepSeek R1 is an MoE model, and the mesh MIMD architecture is well-suited to expert routing without the warp divergence cost a SIMT GPU pays.

3. TTS Workload Comparison (Academic Paper)

Published academic benchmark: "Lightning V2 on Tenstorrent"
Claims 4x lower cost per inference vs NVIDIA L40S
Attributable to distributed dataflow architecture, distributed on-chip SRAM, and 1:1 thread-to-core mapping
One of the cleaner head-to-head academic comparisons available

4. Per-Chip Raw Compute

Blackhole at 745 TFLOPS FP8 per Tenstorrent's spec.
NVIDIA A100 at 624 TOPS INT8 (1,248 sparse).
Suggests Blackhole was originally positioned as an A100 competitor on raw compute, not an H100/B200 competitor.
Wormhole n300's 466 FP8 TFLOPS at 300 W is similarly in the A100 class.
Tenstorrent argues its lead is in cost per FLOP, energy per FLOP, and scaling efficiency once you put dozens or hundreds of chips together.

5. ASPLOS 2025 Microbenchmarks

Published paper: "Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking"
Single Tensix core sustained nearly its theoretical 32-element-per-cycle throughput for single-precision addition - SFPU vector unit fully utilized.
Mandelbrot rendering parallel test: 22.4x speedup over a single-core CPU on all Tensix cores.
Headline finding: the cache-free architecture does not hinder single-core performance for workloads with regular memory access patterns - but shifts the programming burden onto the software stack.

What the Chips Are Not Good At Yet

Three honest weaknesses:

1. Per-Chip Memory Capacity Is Lower

Blackhole p150: 32 GB GDDR6
H100 SXM: 80 GB HBM3
H200: 141 GB HBM3e
MI300X: 192 GB HBM3

Tenstorrent's answer: pool memory across the on-chip Ethernet mesh - 4 Blackhole p150 cards linked via QSFP-DD share 128 GB total at high bandwidth. Only works if your workload tolerates the explicit multi-chip programming model.

2. GDDR6 vs HBM

Wormhole and Blackhole use GDDR6 - significantly cheaper than HBM3/HBM3e.
But lower bandwidth per chip:
- Wormhole: 288-576 GB/s
- Blackhole: ~1 TB/s estimated
- H100: 3.35 TB/s
- MI300X: 5.3 TB/s
Bet: distributed on-chip SRAM (210 MB original / 180 MB after firmware revision) + mesh scaling compensates.

3. Software Ecosystem Maturity

No Tenstorrent equivalent to vLLM with PagedAttention and continuous batching today.
Production deployments depend on TT-Metal kernels or TT-NN's PyTorch-like layer with the verified model library.
TT-Forge MLIR compiler is improving rapidly - GitHub claims 800+ model variants tested in CI, with GPT-OSS 120B, Llama 3 70B, Stable Diffusion XL, Whisper, and YOLOv12 all running today from PyTorch, JAX, or ONNX.
But production-grade serving infrastructure (rate limiting, observability, autoscaling, multi-tenancy) is still earlier than NVIDIA's toolchain.

The Takeaway for Buyers

If you want lowest cost-per-token today on standard workloads

Tenstorrent is not the right answer yet.
NVIDIA H100s or B200s on a neo-cloud will be cheaper and easier.

If your workload is MoE-heavy or long-context

DeepSeek-class, frontier reasoning models, large-batch inference where data movement dominates.
You have hardware procurement authority and can invest in custom kernel work or take advantage of Tenstorrent's verified model library.
The Galaxy Blackhole numbers are compelling on a TCO basis.

If you are a sovereign / regulated buyer

Government compute programs, regulated industries, hyperscalers building custom silicon.
Auditable open hardware is a procurement requirement, not a preference.
Tenstorrent is uniquely positioned and there is no equivalent open competitor at this maturity level.

Coming Next

In Part 5, we get into the architectural deep dive: side-by-side comparison of NVIDIA, AMD, and Tenstorrent at the technical and strategic level, with a particular focus on why Tenstorrent's architectural choices are well-suited to inference.

Sources: Tenstorrent official documentation (docs.tenstorrent.com), hardware product pages (tenstorrent.com/hardware/wormhole, tenstorrent.com/hardware/blackhole), Tenstorrent newsroom posts (TT-Deploy, Blackhole Developer Products launch), Tom's Hardware coverage of the p150 firmware revision (Feb 2026), VideoCardz coverage, GitHub repositories (tt-metal, tt-forge), "Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking" (ASPLOS 2025), "Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities" (arXiv 2505.06085), "Attention in SRAM on Tenstorrent Grayskull" (arXiv 2407.13885), Spheron Tenstorrent vs NVIDIA comparison (April 2026), wccftech coverage of the Galaxy Blackhole launch.

GPU NET

Our Official Channels:

Website | Twitter | Telegram | Discord

Understanding BERT: A State of the Art Model for NLP Using Deep Bidirectional Transformers

BERT recently got popular after its debut in 2018, courtesy of Google AI Language, short for Bidirectional Encoder Representations from Transformers. This new tool has become super important in the world of AI, especially for understanding human language. It’s like having a Swiss army knife for language related challenges, capable of handling tasks ranging from understanding sentiments in text to identifying important names and phrases.

Sujal Sripathi

July 16, 2024

Community Program

Assessing Large Language Models for Program Synthesis

Can big computer programs make new ones? Some experts think they can, especially the really big ones. These programs are great at understanding language and creating complex computer code. People who know a lot about coding are impressed because these programs can write difficult programs easily. It shows how smart computers have become at understanding language and making new things with it. This is where prompt engineering comes in. Engineers use special instructions or prompts to help these programs learn to do cool things like creating new computer programs. By guiding them with exact directions, engineers make sure these programs can comprehend & write complex code right.

Sujal Sripathi

July 12, 2024

Community Program

Compute Is Already an Asset Class. Tokenization Decides Who Gets to Own It.

Wall Street spent three years quietly rebuilding GPUs into investment-grade collateral. Tokenization is the layer that decides whether you're on the cap table or watching from outside and GPUnet's RWA Pool puts real GPU hardware within reach.

GPUNET

July 21, 2026

AI Inference

Tenstorrent in the Real World: Benchmarks, Customers, and the Inference Bet That's Starting to Pay Off

Across five pieces, we have built up a picture: an inference market growing from USD 106 billion in 2025 to a projected USD 255 billion by 2030; a NVIDIA-dominated landscape with CUDA lock-in and pricing pressure; an AMD that has caught up on silicon but still trails on software; and a Tenstorrent that is betting on architectural divergence to break the GPU paradigm for inference. The question for this final piece is the only one buyers actually care about: is it working? What do the published benchmarks say, who is buying or licensing, and what is the real-world impact on user-facing inference workloads? This is the closing Part 6 of the AI Inference Hardware series.

Yashwanth

June 14, 2026

AI Inference

NVIDIA vs AMD vs Tenstorrent: An Architectural Deep Dive on Inference

This piece tries to do the most: line up NVIDIA, AMD, and Tenstorrent side by side at the architectural level - execution model, memory hierarchy, interconnect, software stack, and the strategic shape of each company's bet - and explain why Tenstorrent's choices, while less proven, are particularly well-fitted to the direction inference workloads are heading. The question is not 'which is best today' but 'which architectural lineage is best matched to where inference is going, and why does that matter for buyers and operators?' This is Part 5 of the AI Inference Hardware series.

Yashwanth

June 13, 2026

AI Inference

For most of the AI boom so far, the headlines have belonged to training. But somewhere between GPT-4 and the agent-driven applications now shipping inside every product team's roadmap, the center of gravity quietly shifted. Training is still expensive, but inference - actually running those models in production, every second, against real user traffic - is where the money is now being spent. And it is where the hardware fight is heating up the fastest. This piece is Part 1 of a 6-part deep dive into the inference hardware landscape, walking through the market shape, NVIDIA's dominance, AMD's catch-up, and the cracks in the GPU paradigm that are opening doors for challengers.

Yashwanth

June 9, 2026

Tutorials

GPU Quest: Road to TGE

GPU.net is advancing decentralized computing by enabling GPU resource sharing, and its upcoming Token Generation Event (TGE) marks a significant milestone. The “Road to TGE” campaign on token.gpu.net provides a structured way for participants to earn rewards and engage with the project. This overview explains the campaign, its components, and how you can get involved in a professional and straightforward manner.

Surya Ranjith

May 11, 2025

Tutorials

GPU SUBNETS - A New Era

In a world where centralized GPU computing is expensive and restrictive, Subnets on GAN Chain offer a decentralized revolution. By connecting creators, users, and investors to a global pool of GPU resources, Subnets deliver affordable, scalable, and community-governed computing power. With tools to simplify project deployment, incentives for participation, and AI-optimized resource allocation, GPU.NET is not just solving today's GPU challenges — it's building the future. Join the movement: create, innovate, and grow with Subnets on GAN Chain.

Surya Ranjith

May 10, 2025

Provider Guide

Complete Guide on running a GPU Provider Nodes

This guide aims to minimize the friction in using documentation, providing you with a streamlined approach to set up your Provider GPU node. We'll walk you through the essential steps, ensuring you gather all the correct procedures effortlessly. With this guide, you'll have a clear path to running your Provider node efficiently. Let's dive into the steps and make the setup process as smooth as possible.

DJAL

August 12, 2024

Validator Guide

Complete Guide on running a GPU Validator Node

This guide aims to minimize the friction in using documentation, providing you with a streamlined approach to set up your validator GPU node. We'll walk you through the essential steps, ensuring you gather all the correct procedures effortlessly. With this guide, you'll have a clear path to running your validator node efficiently. Let's dive into the steps and make the setup process as smooth as possible.

DJAL

August 12, 2024

Community Program

Large Multimodal Models (LMMs) vs Large Language Models (LLMs)

Large multimodal models (LMMs) are a big change because they can handle different types of data like text, images, and audio. But they are complex and need a lot of data, which can be tricky at times. From the start, it was evident that AI would need to be multifunctional and serve as a single platform for various purposes, and LMM exactly is that.

Sujal Sripathi

August 9, 2024