Nvidia’s Blackwell Ultra Dominates MLPerf Inference

The machine learning field is moving fast, and the yardsticks used measure progress in it are having to race to keep up. A case in point, MLPerf, the bi-annual machine learning competition sometimes termed “the Olympics of AI,” introduced three new benchmark tests, reflecting new directions in the field.

“Lately, it has been very difficult trying to follow what happens in the field,” says Miro Hodak, AMD engineer and MLPerf Inference working group co-chair. “We see that the models are becoming progressively larger, and in the last two rounds we have introduced the largest models we’ve ever had.”

The chips that tackled these new benchmarks came from the usual suspects—Nvidia, Arm, and Intel. Nvidia topped the charts, introducing its new Blackwell Ultra GPU, packaged in a GB300 rack-scale design. AMD put up a strong performance, introducing its latest MI325X GPUs. Intel proved that one can still do inference on CPUs with their Xeon submissions, but also entered the GPU game with an Intel Arc Pro submission.

New Benchmarks

Last round, MLPerf introduced its largest benchmark yet, a large language model based on Llama3.1-403B. This round, they topped themselves yet again, introducing a benchmark based on the Deepseek R1 671B model—more than 1.5 times the number of parameters of the previous largest benchmark.

As a reasoning model, Deepseek R1 goes through several steps of chain-of-thought when approaching a query. This means much of the computation happens during inference then in normal LLM operation, making this benchmark even more challenging. Reasoning models are claimed to be the most accurate, making them the technique of choice for science, math, and complex programming queries.

In addition to the largest LLM benchmark yet, MLPerf also introduced the smallest, based on Llama3.1-8B. There is growing industry demand for low latency yet high-accuracy reasoning, explained Taran Iyengar, MLPerf Inference task force chair. Small LLMs can supply this, and are an excellent choice for tasks such as text summarization and edge applications.

This brings the total count of LLM-based benchmarks to a confusing four. They include the new, smallest Llama3.1-8B benchmark; a pre-existing Llama2-70B benchmark; last round’s introduction of the Llama3.1-403B benchmark; and the largest, the new Deepseek R1 model. If nothing else, this signals LLMs are not going anywhere.

In addition to the myriad LLMs, this round of MLPerf inference included a new voice-to-text model, based on Whisper-large-v3. This benchmark is a response to the growing number of voice-enabled applications, be it smart devices or speech-based AI interfaces.

TheMLPerf Inference competition has two broad categories: “closed,” which requires using the reference neural network model as-is without modifications, and “open,” where some modifications to the model are allowed. Within those, there are several subcategories related to how the tests are done and in what sort of…

Read full article: Nvidia’s Blackwell Ultra Dominates MLPerf Inference

The post “Nvidia’s Blackwell Ultra Dominates MLPerf Inference” by Dina Genkina was published on 09/10/2025 by spectrum.ieee.org