Token pricing works like any product business: the higher the tier, the higher the quality and performance, but the lower the volume and capacity. That pattern exists in every industry. What Nvidia has done with the Grace Blackwell architecture is increase the performance of these tiers by 35× and introduce an entirely new tier. That represents a major jump compared with the previous Hopper generation.
At every tier the company increased throughput, and in the most valuable tier — the one with the highest average selling price — it increased performance by 10×. Achieving that is extremely difficult. It comes from technologies such as NVLink 72, extremely low-latency interconnects, and deep hardware–software co-design. These advances allow the entire performance curve to shift upward.
How power gets allocated across tiers
From a customer perspective, imagine distributing the power of a data centre across service tiers. Suppose 25% of the available power runs a free tier, 25% supports a mid-tier service, 25% runs a high tier, and 25% powers a premium tier. A typical large AI data centre might have around one gigawatt of power capacity, so the operator decides how to allocate that power.
The free tier helps attract users, while the premium tier serves the highest-value customers. When you multiply the throughput improvements across all tiers, the result directly translates into revenue. In a simplified example, the Blackwell architecture can generate roughly five times more revenue capacity than earlier systems. The Rubin generation could deliver around five times more again. That is why deploying the Vera Rubin platform quickly becomes important: token costs decrease while throughput increases.
The throughput–latency trade-off
There is still a fundamental challenge. High throughput requires enormous floating-point compute performance, while low latency requires extremely high bandwidth. Computer systems struggle to deliver both at the same time because there is only so much physical space on a chip and in a system for compute units and memory bandwidth. Optimising for maximum throughput and optimising for minimum latency are often conflicting goals.
NVLink-based systems like Vera Rubin excel at high-throughput, batch-friendly workloads: they can process huge numbers of tokens across many users when latency per user is less critical. But if you extend the requirements further — say you want to generate 1,000 tokens per second instead of 400 tokens per second for a single stream — eventually NVLink-based systems reach their bandwidth limits. Pushing past that ceiling is where a different kind of processor becomes useful.
Why tier improvements matter for AI factories
For operators running a gigawatt-scale AI factory, the math is straightforward. If Blackwell delivers roughly five times more revenue capacity than the previous generation, and Rubin delivers another factor of about five, then two architecture cycles can multiply the revenue potential of the same power envelope by an order of magnitude. That does not mean every operator will capture that full gain — competition and pricing will determine how much flows to the bottom line — but it does mean that the factories with the latest stacks have a structural advantage.
Deploying Vera Rubin quickly is therefore not just a technical choice; it is an economic one. Earlier deployment means earlier access to lower token costs and higher throughput, which in turn supports more aggressive pricing, larger context windows or faster token speeds for premium customers. In a market where tokens are becoming a commodity and tiers are segmenting by price and performance, the factories that can offer the best curve — more throughput at every tier and a credible premium tier at the top — will capture a disproportionate share of high-value workloads.
What this means for the industry
The Grace Blackwell and Rubin story is a reminder that AI infrastructure is not a single product but a layered performance curve. Free tiers, mid tiers, high tiers and premium tiers each consume a slice of the same power budget. The architectures that shift that curve upward — 35× on tier performance, 10× on the highest-value tier, and roughly 5× revenue capacity per generation — are the ones that will define who can afford to run which services at scale. For Nvidia, that is the logic of betting so heavily on NVLink 72, co-design, and the rapid rollout of the Vera Rubin platform.
In short: the same gigawatt that used to support one curve of free-to-premium tiers now supports a steeper curve with higher throughput at every level and a new top tier that was not feasible before. That is why tier economics and hardware roadmaps are inseparable in the AI factory era. Operators who deploy Grace Blackwell and Vera Rubin first will see both lower cost per token and a stronger position in the premium segment where margins are highest.
Sources
- Nvidia GTC keynote on Grace Blackwell and Rubin tier performance (35×, 10× on premium tier), power allocation across tiers, and revenue capacity (5× per generation)
- Nvidia materials on NVLink 72, Vera Rubin deployment and AI factory economics
- Industry analysis of throughput versus latency trade-offs in large-scale inference