Inside the Tenstorrent Chips: Grayskull, Wormhole, Blackhole, and Galaxy
A company can have the most compelling thesis in the world, but it lives or dies on the silicon. This piece walks through the actual Tenstorrent chips - what they are, what they cost, and what the published, mostly open-source benchmarks say about how they stack up against NVIDIA and AMD. We cover the full product ladder from Grayskull to Galaxy Blackhole, explain what makes a Tensix core architecturally different, and walk through the published benchmarks honestly - including where the numbers come from and what caveats apply. This is Part 4 of the AI Inference Hardware series.
Introduction
A company can have the most compelling thesis in the world, but it lives or dies on the silicon. In Part 3 we covered Tenstorrent's strategy and team. This piece walks through:
- The actual chips - what they are and what they cost
- What makes a Tensix core architecturally different
- What the published benchmarks say honestly, including caveats
The Tenstorrent Product Ladder
Tenstorrent has three chip generations in market, each with multiple PCIe board variants, plus rack-scale systems built on top.
1. Grayskull (Gen 1)
- First-generation chip, announced 2021.
- Developer cards: e75 (USD 599) and e150 (USD 799).
- 120 Tensix cores, mostly used as evaluation hardware.
- Largely superseded by Wormhole now.
- Historically significant as the first commercial chip to ship the Tensix architecture.
2. Wormhole (Gen 2, mid-2024)
The chip with the most mature TT-Metal documentation, tutorials, and verified model support as of Q1 2026.
Wormhole n150 (single-chip PCIe card):
- 72 Tensix cores
- 108 MB distributed SRAM
- 12 GB GDDR6 at 288 GB/s
- 160 W TDP
- 262 FP8 TFLOPS at 1 GHz
Wormhole n300 (dual-chip board):
- 128 Tensix cores
- 192 MB SRAM
- 24 GB GDDR6 at 576 GB/s aggregate
- 300 W TDP
- 466 FP8 TFLOPS
Each Tensix core contains five baby RISC-V cores plus the matrix and vector engines.
3. Blackhole (Gen 3, 2025)
Tenstorrent's strongest current chip and the basis for the Galaxy Blackhole rack systems. Launched at Tenstorrent Dev Day. Manufactured on Samsung 6nm.
PCIe board variants:
- p100a - USD 999, no Ethernet, active-cooled for desktop, 28 GB GDDR6, 120 Tensix cores.
- p150a - USD 1,399, with Ethernet, active-cooled, 32 GB GDDR6.
- p150b - USD 1,399, passive-cooled for rack servers.
Each board has four passive QSFP-DD 800G ports for linking cards together. Each chip has 16 "big" RISC-V cores plus the Tensix mesh.
Published peak compute: 745 TFLOPS FP8 per Tenstorrent's specifications (not yet independently confirmed at scale).
Total Ethernet bandwidth: 1 TBps across ten 400 Gbps links.
Important note on Blackhole specs: In February 2026, Tenstorrent issued firmware v19.5.0 which reduced the published Tensix core count on the p150 cards from 140 to 120, and SRAM from 210 MB to 180 MB. Existing users should expect a 1-2% performance drop. The change was communicated by email and disclosed on GitHub. Published BLOCKFP8 figure on p150 moved from 774 TFLOPS to 664 TFLOPS. Tenstorrent did not publicly elaborate on the reason, though the most plausible read is silicon yield management on the 6nm process. This is the kind of transparency-after-the-fact that a fully open hardware company subjects itself to, and worth noting honestly.
4. TT-QuietBox
- Liquid-cooled desktop workstation built around 4 Blackhole processors.
- Priced at USD 11,999 (base) to USD 15,000 (upgraded with 4 Wormhole n300s + AMD EPYC 8124P 16-core CPU under liquid cooling).
5. Galaxy Wormhole
- 6U rack system with 32 Wormhole processors on a single board.
- Connected through the on-chip Ethernet mesh without any PCIe switch in the data path.
- Building block for multi-machine Tenstorrent deployments using TT-Fabric.
6. Galaxy Blackhole
The newest rack system, announced at TT-Deploy in 2026. Air-cooled.
Base configuration:
- 32 Blackhole chips
- 23 PFLOPS of FP8 compute
- 6.2 GB on-chip SRAM at 2.9 PB/s aggregate
- 1 TB DRAM at 16 TB/s aggregate
- 56 x 800G Ethernet ports for up to 11.2 TB/s scale-out bandwidth
Pricing:
- Starts at USD 110,000 per Galaxy unit.
- Supercluster configurations of 4 to 36 Galaxies available.
- 4-Galaxy supercluster base: USD 440,000.
What Makes a Tensix Core Different
The Tensix core is the unit of computation in Tenstorrent silicon, and understanding it is the key to understanding why benchmarks against GPUs do not behave the way you might expect.
Internal Structure
A single Tensix core contains:
- 5 small RISC-V cores ("baby RISC-V")
- Matrix engine (FPU)
- Vector engine (SFPU)
- Data pack/unpack units
- 1.5 MB of dedicated local SRAM
The five baby cores each handle a distinct stage of the compute pipeline:
- Ingestion of operands from the network-on-chip
- Unpacking
- Matrix or vector compute
- Packing
- Output back to the NoC
They are not general-purpose multi-threaded cores in the CPU sense - they are specialized control and dataflow processors that coordinate the matrix and vector units.
Key Architectural Choices
- No SIMT execution model. No warp of 32 threads executing in lock-step.
- No cache hierarchy. The 1.5 MB of SRAM per core is software-managed. Data movement is explicit via DMA operations. Deterministic memory access latency - hugely valuable for compilers that want to schedule data movement precisely.
- No hardware multithreading hiding latency. Each baby RISC-V core is single-threaded. The architecture works through software pipelining and asynchronous I/O rather than warp scheduling.
- Native 32x32 tile granularity - deliberate choice optimized for matrix multiplication and convolution.
Chip-Level Organization
Inside the chip, Tensix cores are arranged in a 2D mesh. Blackhole has:
- A grid of 140 Tensix cores (now 120 after the firmware change on p150)
- 24 GDDR6 memory controllers
- I/O and management units: Ethernet, PCIe, Arc microcontroller, larger RISC-V control CPUs
The network-on-chip (NoC) handles data movement between Tensix cores and between cores and memory controllers, with high bandwidth and predictable routing.
Why This Matters for Inference
- Explicit data movement model maps cleanly onto the data flow of transformer inference (load weights into SRAM, stream activations through, write KV-cache back).
- Distributed SRAM acts as a fast scratchpad that reduces DRAM traffic.
- On-chip Ethernet means scaling out to dozens of chips does not require a switch hierarchy that becomes its own bottleneck.
Open-Source Benchmarks: Where the Data Lands
The honest summary: independent third-party benchmarks of Tenstorrent hardware are still sparse. Most published numbers come from one of three sources:
- Tenstorrent's own marketing and engineering blog posts
- Academic papers with microbenchmarks
- Independent technology blogs (Spheron, Moor Insights & Strategy, EE Times)
Within those constraints, headline numbers worth taking seriously:
1. Wormhole Galaxy on Llama 70B
- Per Tenstorrent's own benchmarking: 4,000-5,000 tokens/sec on Llama 70B at batch 32 on TT-Metal.
- For comparison: 8x H100 SXM5 node running vLLM achieves 2,500-3,500 tok/s at the same batch size.
- Caveat: H100 numbers reflect a full production serving stack (vLLM, PagedAttention, continuous batching, queue management). Galaxy numbers are controlled single-model benchmark runs without an equivalent production serving layer.
2. Galaxy Blackhole on DeepSeek R1 (the headline number)
Demonstrated at TT-Deploy in 2026:
- 350+ tokens per second per user on DeepSeek R1 671B
- Across 16 Galaxy Blackhole units (512 Blackhole chips total)
- Batch size 32, ~4-second time-to-first-token on 100K context
- Running prefill and decode on the same hardware
- Cost claim: USD 6 per million tokens, vs implied USD 30/M on NVIDIA - a 5x TCO advantage
The architectural fit story matters here: DeepSeek R1 is an MoE model, and the mesh MIMD architecture is well-suited to expert routing without the warp divergence cost a SIMT GPU pays.
3. TTS Workload Comparison (Academic Paper)
- Published academic benchmark: "Lightning V2 on Tenstorrent"
- Claims 4x lower cost per inference vs NVIDIA L40S
- Attributable to distributed dataflow architecture, distributed on-chip SRAM, and 1:1 thread-to-core mapping
- One of the cleaner head-to-head academic comparisons available
4. Per-Chip Raw Compute
- Blackhole at 745 TFLOPS FP8 per Tenstorrent's spec.
- NVIDIA A100 at 624 TOPS INT8 (1,248 sparse).
- Suggests Blackhole was originally positioned as an A100 competitor on raw compute, not an H100/B200 competitor.
- Wormhole n300's 466 FP8 TFLOPS at 300 W is similarly in the A100 class.
- Tenstorrent argues its lead is in cost per FLOP, energy per FLOP, and scaling efficiency once you put dozens or hundreds of chips together.
5. ASPLOS 2025 Microbenchmarks
- Published paper: "Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking"
- Single Tensix core sustained nearly its theoretical 32-element-per-cycle throughput for single-precision addition - SFPU vector unit fully utilized.
- Mandelbrot rendering parallel test: 22.4x speedup over a single-core CPU on all Tensix cores.
- Headline finding: the cache-free architecture does not hinder single-core performance for workloads with regular memory access patterns - but shifts the programming burden onto the software stack.
What the Chips Are Not Good At Yet
Three honest weaknesses:
1. Per-Chip Memory Capacity Is Lower
- Blackhole p150: 32 GB GDDR6
- H100 SXM: 80 GB HBM3
- H200: 141 GB HBM3e
- MI300X: 192 GB HBM3
Tenstorrent's answer: pool memory across the on-chip Ethernet mesh - 4 Blackhole p150 cards linked via QSFP-DD share 128 GB total at high bandwidth. Only works if your workload tolerates the explicit multi-chip programming model.
2. GDDR6 vs HBM
- Wormhole and Blackhole use GDDR6 - significantly cheaper than HBM3/HBM3e.
- But lower bandwidth per chip:
- Wormhole: 288-576 GB/s
- Blackhole: ~1 TB/s estimated
- H100: 3.35 TB/s
- MI300X: 5.3 TB/s
- Bet: distributed on-chip SRAM (210 MB original / 180 MB after firmware revision) + mesh scaling compensates.
3. Software Ecosystem Maturity
- No Tenstorrent equivalent to vLLM with PagedAttention and continuous batching today.
- Production deployments depend on TT-Metal kernels or TT-NN's PyTorch-like layer with the verified model library.
- TT-Forge MLIR compiler is improving rapidly - GitHub claims 800+ model variants tested in CI, with GPT-OSS 120B, Llama 3 70B, Stable Diffusion XL, Whisper, and YOLOv12 all running today from PyTorch, JAX, or ONNX.
- But production-grade serving infrastructure (rate limiting, observability, autoscaling, multi-tenancy) is still earlier than NVIDIA's toolchain.
The Takeaway for Buyers
If you want lowest cost-per-token today on standard workloads
- Tenstorrent is not the right answer yet.
- NVIDIA H100s or B200s on a neo-cloud will be cheaper and easier.
If your workload is MoE-heavy or long-context
- DeepSeek-class, frontier reasoning models, large-batch inference where data movement dominates.
- You have hardware procurement authority and can invest in custom kernel work or take advantage of Tenstorrent's verified model library.
- The Galaxy Blackhole numbers are compelling on a TCO basis.
If you are a sovereign / regulated buyer
- Government compute programs, regulated industries, hyperscalers building custom silicon.
- Auditable open hardware is a procurement requirement, not a preference.
- Tenstorrent is uniquely positioned and there is no equivalent open competitor at this maturity level.
Coming Next
In Part 5, we get into the architectural deep dive: side-by-side comparison of NVIDIA, AMD, and Tenstorrent at the technical and strategic level, with a particular focus on why Tenstorrent's architectural choices are well-suited to inference.
Sources: Tenstorrent official documentation (docs.tenstorrent.com), hardware product pages (tenstorrent.com/hardware/wormhole, tenstorrent.com/hardware/blackhole), Tenstorrent newsroom posts (TT-Deploy, Blackhole Developer Products launch), Tom's Hardware coverage of the p150 firmware revision (Feb 2026), VideoCardz coverage, GitHub repositories (tt-metal, tt-forge), "Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking" (ASPLOS 2025), "Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities" (arXiv 2505.06085), "Attention in SRAM on Tenstorrent Grayskull" (arXiv 2407.13885), Spheron Tenstorrent vs NVIDIA comparison (April 2026), wccftech coverage of the Galaxy Blackhole launch.





![GPUNET Verifiable Exchange: The Next Frontier for $GPU, Nodes and Ecosystem [TEASER]](https://i.ibb.co/Z1JWjN7r/Article-Cover.png)










