In the last year alone, token production has increased nearly one hundred times. Modern AI systems are essentially token factories. For the companies running them, everything depends on how efficiently they can generate tokens. The effectiveness, performance and cost of producing tokens now determine the success of their AI infrastructure and, increasingly, their business models.
In Nvidia CEO Jensen Huang’s framing, data centres are no longer just storage rooms and application hosts. They have become AI factories that take in data and electricity and output tokens. Those tokens power chatbots, coding agents, search assistants, recommendation engines and countless internal tools. The new unit of productivity is not just requests per second but tokens per second delivered at an acceptable cost and latency.
From 700 to nearly 5,000 tokens per second
The economics of this shift show up clearly in token throughput benchmarks. In one example from Nvidia’s GTC narrative, the software stack running on a fixed piece of hardware was optimised so aggressively that token generation speed jumped from around 700 tokens per second to nearly 5,000 tokens per second. That is roughly a seven-fold increase in performance without changing the underlying system.
What changed was not the silicon but the hardware–software co-design around it: kernels, compilers, runtime libraries, scheduling strategies and model execution graphs were all tuned so that GPUs spent less time idle and more time turning electricity into tokens. For customers, that kind of improvement effectively cuts the cost per token and increases the practical capacity of their AI factories overnight.
Benchmarks from Nvidia and its partners show similar stories elsewhere: smarter memory layouts, mixed-precision arithmetic, speculative decoding, batching across users and GPU partitioning can all dramatically increase tokens-per-second at a given power budget. The absolute numbers vary by model and configuration, but the pattern is consistent: software optimisation is now as central to AI economics as raw hardware performance.
Why tokens are the new commodity
Inference is now the dominant workload in AI. Every time a language model answers a question, writes a paragraph or reasons through a coding task, it consumes tokens. Those tokens are the unit that cloud providers bill for, that startups track in their internal dashboards and that investors increasingly use as a proxy for usage and revenue potential.
In that sense, tokens have become a new commodity of computing. A decade ago, the key metrics were CPU cores, virtual machines or storage capacity. Today, the central question is how many high-quality tokens a given system can produce per second, per dollar and per megawatt. Companies that can secure more efficient token production gain pricing power, higher margins or both.
This is why Nvidia and others emphasise tokens-per-second in their keynote slides and technical blogs. It is not just a performance brag; it is a statement about who can operate AI factories most profitably. For hyperscalers and AI-native companies, a higher tokens-per-watt figure translates directly into the ability to serve more users, support more complex models or lower prices while maintaining margins.
Data centres as power-constrained factories
The token factory metaphor also highlights an uncomfortable constraint: power. Traditional data centres were often network- or storage-limited; AI factories are power-limited. Once a site is built, companies must live within a fixed megawatt or gigawatt envelope negotiated with utilities or backed by dedicated generation.
Within that limit, every architectural decision is about maximising useful token output. Rack design, cooling, GPU density, networking topology and job scheduling are all tuned to keep accelerators as close to 100% utilised as possible without breaching power or thermal caps. In practice, that means replacing or retrofitting legacy racks, adopting liquid cooling and treating power budgets as first-class product constraints rather than afterthoughts.
This is one reason Nvidia talks about AI factories rather than just GPU clusters. The factory analogy forces operators to think about throughput, yield, uptime and energy efficiency in the same way that a car plant or semiconductor fab would. Downtime in an AI factory is not just lost compute; it is lost token production and therefore lost revenue opportunity.
Why architecture choices now read like factory design
Because tokens drive revenue, architecture choices are increasingly evaluated like industrial engineering trade-offs. Should a company invest in more GPUs, faster networking, larger memory footprints or better storage tiers? The answer depends on where tokens are being bottlenecked. If GPUs are waiting on I/O, storage upgrades may yield more tokens than another rack of accelerators. If interconnect bandwidth is saturated, moving to NVLink-based topologies or higher-bandwidth Ethernet can unlock stalled performance.
Nvidia’s recent platforms, from DGX SuperPODs to Blackwell-based NVL72 systems and the upcoming Rubin-based Vera Rubin platforms, are marketed explicitly as end-to-end AI factories. They combine GPUs, CPUs, networking, storage and orchestration software into systems whose primary purpose is to maximise token throughput per megawatt and per dollar. The company’s message is that enterprises do not just need GPUs; they need factory-grade architectures.
Every enterprise will measure its token factory
Looking forward, Huang argues that every cloud provider, every computer company, every AI vendor and eventually nearly every large enterprise will evaluate the efficiency of its token factory. Intelligence is becoming a core input to products and decisions; in the future that intelligence will increasingly be produced through tokens generated by AI systems rather than manually written software.
That has two implications. First, token economics will become a standard part of boardroom and budget discussions. Leaders will ask not just how many models they run but how efficiently they turn data and power into useful outputs. Second, competitive advantage will depend on securing access to efficient AI factories, whether by building their own or partnering with providers that have already optimised the stack.
The story Nvidia is telling at GTC is that the world has entered an era where tokens are the new compute commodity and AI factories are the plants that produce them. For now, companies with the best combination of hardware, software and power planning are the ones turning tokens into money the fastest. As token production scales further — and as software optimisations squeeze more throughput out of each system — the gap between efficient and inefficient token factories will only grow.
Sources
- Nvidia GTC keynote remarks on AI factories, token factories and the shift from data centres to token-producing infrastructure
- Nvidia technical blogs and benchmarks on tokens-per-second throughput and hardware–software co-design for inference
- Industry analysis of AI factories as power-constrained token plants and the economics of tokens as a new computing commodity