High throughput and low latency place conflicting demands on hardware: throughput wants massive floating-point compute, while low latency wants extremely high bandwidth. There is only so much space on a chip for compute and memory. Nvidia’s answer for the highest-value inference tier is to combine two very different architectures. It licensed technology from the team that built Groq processors and integrated it into the system design. The result significantly improves the top tier of inference workloads, increasing performance by about 35×.
If most workloads need very high throughput, a data centre might run entirely on the Vera Rubin architecture. But if part of the workload involves high-value coding tasks or extremely fast token generation, then it can make sense to allocate perhaps 25% of the infrastructure to Groq-based systems, while the rest remains Rubin-based. That combination extends both performance and economic value.
Why Groq is different
A Groq processor is a deterministic dataflow processor. It is statically compiled and compiler scheduled: the compiler determines in advance exactly when data arrives and when computation occurs. There is no dynamic scheduling at runtime. The architecture also includes large amounts of on-chip SRAM and is designed specifically for inference workloads — which is exactly the workload that dominates AI factories today.
The difference between the chips is significant. A Groq LP300 chip contains roughly 500 MB of SRAM, while a Vera Rubin GPU can access far larger memory capacity — hundreds of gigabytes for model parameters and context. Large models with trillions of parameters require massive memory and large KV caches during inference. No single LP300 can hold that; the system needs a way to split the work so that each processor does what it does best.
Dynamo: prefill on Rubin, decode on Groq
To solve this, Nvidia introduced a new software layer called Dynamo. Instead of running inference in a single monolithic pipeline, Dynamo reorganises the inference process so that different parts run on different processors. High-throughput tasks run on the Vera Rubin GPUs, while low-latency decoding tasks run on Groq processors.
In practice, the prefill stage — the attention-heavy, context-loading phase that processes the user’s input and fills the KV cache — is handled by the Rubin GPUs, which are strong in large-scale matrix math and have the memory bandwidth to hold huge models and contexts. The decode stage, which is responsible for fast token generation one token at a time, is offloaded to Groq processors. The two systems work together over high-speed Ethernet using specialised low-latency communication modes.
This architecture unifies two very different processors: one optimised for high throughput and the other for ultra-low latency. The system still requires large memory capacity, so many Groq chips are deployed together to expand available memory resources while Rubin GPUs handle the heavy compute. Running the Dynamo operating system for AI factories on top of this hybrid design enables a combined performance improvement of about 35× on the highest-value tier and introduces new tiers of inference performance for token generation that were previously not possible.
When NVLink hits its limit
NVLink 72 and Vera Rubin dominate many AI workloads today because they provide an extremely strong architecture for high-throughput environments. But if you extend the requirements — for example, generating 1,000 tokens per second instead of 400 tokens per second per user — NVLink-based systems eventually reach their bandwidth limits. That is where Groq processors help extend the performance range. The 25% Groq / 75% Rubin mix is a way to reserve capacity for those high-value, latency-sensitive streams without rebuilding the whole factory.
Manufacturing and deployment
The Groq LP300 processor used in these systems is manufactured by Samsung Electronics, which is producing the chips at high volume to support this new generation of AI infrastructure. So the story is not just architectural — it is also about supply. Nvidia does not need to build the LP300 itself; it integrates Samsung-made Groq chips into a system design that is orchestrated by Dynamo and backed by Rubin GPUs for prefill and memory-heavy work.
For operators, the takeaway is that the highest-value tier of inference — fast token generation for coding agents, research tools and premium APIs — can be stretched by combining Rubin’s throughput with Groq’s deterministic, low-latency decode. Deploying Dynamo on top of that hybrid stack is what delivers the 35× improvement and the new tiers that make the premium end of the token market possible. As demand grows for faster token generation and more capable models, the value of combining these two architectures in one factory only increases. Samsung’s volume production of the LP300 ensures that Groq-based capacity can scale alongside Rubin deployments.
Sources
- Nvidia GTC keynote on Groq integration, LP300, Dynamo (prefill on Rubin, decode on Groq), and 35× improvement on the highest-value inference tier
- Nvidia and Groq materials on deterministic dataflow, compiler-scheduled inference and SRAM-based inference processors
- Industry reporting on Samsung manufacturing of Groq LP300 and hybrid inference architectures