In the latest round of machine learning benchmark results from MLCommons, computers built around Nvidia’s new Blackwell GPU architecture outperformed all others. But AMD’s latest spin on its Instinct GPUs, the MI325, proved a match for the Nvidia H200, the product it was meant to counter. The comparable results were mostly on tests of one of the smaller-scale large language models Llama2 70B (for 70 billion parameters). However, in an effort to keep up with a rapidly changing AI landscape, MLPerf added three new benchmarks to better reflect where machine learning is headed.
MLPerf runs benchmarking for machine learning systems in an effort to provide an apples-to-apples comparison between computer systems. Submitters use their own software and hardware, but the underlying neural networks must be the same. There are a total of 11 benchmarks for servers now, with three added this year.
It has been “hard to keep up with the rapid development of the field,” says Miro Hodak, the co-chair of MLPerf Inference. ChatGPT only appeared in late 2022, OpenAI unveiled its first large language model (LLM) that can reason through tasks last September, and LLMs have grown exponentially—GPT3 had 175 billion parameters, while GPT4 is thought to have nearly 2 trillion. As a result of the breakneck innovation, “we’ve increased the pace of getting new benchmarks into the field,” says Hodak.
The new benchmarks include two LLMs. The popular and relatively compact Llama2-70B is already an established MLPerf benchmark, but the consortium wanted something that mimicked the responsiveness people are expecting of chatbots today. So the new benchmark “Llama2-70B Interactive” tightens the requirements. Computers must produce at least 25 tokens per second under any circumstance and cannot take more than 450 milliseconds to begin an answer.
Seeing the rise of “agentic AI”—networks that can reason through complex tasks—MLPerf sought to test an LLM that would have some of the characteristics needed for that. They chose Llama3.1 405B for the job. That LLM has what’s called a wide context window. That’s a measure of how much information—documents, samples of code, etc.—it can take in at once. For Llama3.1 405B that’s 128,000 tokens, more than 30 times as much as Llama2 70B.
The final new benchmark, called RGAT, is what’s called a graph attention network. It acts to classify information in a network. For example, the dataset used to test RGAT consist of scientific papers, which all have relationships between authors, institutions, and fields of studies, making up 2 terabytes of data. RGAT must classify the papers into just under 3000 topics.
Blackwell, Instinct Results
Nvidia continued its domination of MLPerf benchmarks through its own submissions and those of some 15 partners such as Dell, Google, and Supermicro. Both its first and second generation Hopper architecture GPUs—the H100 and the memory-enhanced H200—made strong…
Read full article: Nvidia Blackwell Leads AI Inference, AMD Challenges

The post “Nvidia Blackwell Leads AI Inference, AMD Challenges” by Samuel K. Moore was published on 04/02/2025 by spectrum.ieee.org
Leave a Reply