NVIDIA Blackwell Dominates AI Inference Benchmarks, Delivering 15x ROI for AI Factories
NVIDIA’s Blackwell architecture has swept the new SemiAnalysis InferenceMAX v1 benchmarks, establishing a new standard for performance and efficiency in artificial intelligence inference – and promising a dramatic return on investment for businesses building AI-powered infrastructure.
The InferenceMAX v1 benchmark, released Monday, is the first autonomous assessment to measure the total cost of compute across a diverse range of AI models and real-world applications. The results unequivocally demonstrate NVIDIA’s leadership, notably with its GB200 NVL72 system.
Unprecedented Return on Investment
According to the benchmarks, a $5 million investment in an NVIDIA GB200 NVL72 system can generate a remarkable $75 million in token revenue – a 15x return on investment. This figure underscores the evolving economics of AI inference, where efficient compute is paramount.
“Inference is where AI delivers value every day,” said a senior official at NVIDIA. “these results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”
Lowering the Cost of AI
Beyond ROI, NVIDIA’s B200 software optimizations are significantly reducing the total cost of ownership for AI deployments. the company has achieved a cost of just two cents per million tokens on gpt-oss, representing a 5x reduction in cost per token in just two months. This cost reduction is critical as AI models become more complex and generate exponentially more tokens per query.
The Blackwell platform also delivers best-in-class throughput and interactivity, achieving 60,000 tokens per second per GPU and 1,000 tokens per second per user on gpt-oss when utilizing the latest NVIDIA TensorRT-LLM stack.
The Rise of AI Factories and the need for Efficient Inference
As AI transitions from providing simple, one-shot answers to tackling complex reasoning tasks, the demand for inference – and the economic considerations surrounding it – are rapidly increasing. Modern AI requires not just speed, but also efficiency and scalability.
The InferenceMAX v1 benchmark highlights this shift, measuring performance across a wide range of use cases and providing verifiable results for anyone to assess. NVIDIA’s
blackwell’s full-stack design, encompassing hardware and software, delivers efficiency and value where it matters most: in production. A technical deep dive provides further details on the methodology and charts used to build these performance curves.
The Foundation of Blackwell’s Leadership
Blackwell’s success is rooted in its extreme hardware-software codesign and full-stack architecture. Key features include:
- NVFP4: A low-precision format for efficiency without compromising accuracy.
- Fifth-generation NVIDIA NVLink: Connecting 72 Blackwell GPUs to function as a single, massive GPU.
- NVLink Switch: Enabling high concurrency through advanced tensor, expert, and data parallel attention algorithms.
Continuous software optimization, coupled with an annual hardware cadence, has more than doubled Blackwell’s performance since its initial launch. NVIDIA’s open-source inference frameworks – TensorRT-LLM, Dynamo, SGLang, and vLLM – are optimized for peak performance, supported by a vast ecosystem of hundreds of millions of GPUs, 7 million CUDA developers, and contributions to over 1,000 open-source projects.
The Future of AI: From Pilots to factories
AI is rapidly evolving from pilot projects to fully-fledged “AI factories” – infrastructure capable of manufacturing intelligence by converting data into tokens and real-time decisions. Open benchmarks like InferenceMAX v1 are crucial for informed platform selection, cost-per-token optimization, and service-level agreement planning. NVIDIA’s Think SMART framework guides enterprises through this transition,demonstrating how its full-stack inference platform delivers tangible ROI and transforms performance into profits.
