Blackwell, AMD Instinct, Untethered AI: First Benchmarks

While the dominance of
Nvidia GPUs for AI training remains undisputed, we may be seeing early signs that, for AI inference, the competition is gaining on the tech giant, particularly in terms of power efficiency. The sheer performance of Nvidia’s new Blackwell chip, however, may be hard to beat.

This morning,
ML Commons released the results of its latest AI inferencing competition, ML Perf Inference v4.1. This round included first-time submissions from teams using AMD Instinct accelerators, the latest Google Trillium accelerators, chips from Toronto-based startup UntetherAI, as well as a first trial for Nvidia’s new Blackwell chip. Two other companies, Cerebras and FuriosaAI, announced new inference chips but did not submit to MLPerf.

Much like an Olympic sport, MLPerf has many categories and subcategories. The one that saw the biggest number of submissions was the “datacenter-closed” category. The closed category (as opposed to open) requires submitters to run inference on a given model as-is, without significant software modification. The data center category tests submitters on bulk processing of queries, as opposed to the edge category, where minimizing latency is the focus.

Within each category, there are 9 different benchmarks, for different types of AI tasks. These include popular use cases such as image generation (think Midjourney) and LLM Q&A (think ChatGPT), as well as equally important but less heralded tasks such as image classification, object detection, and recommendation engines.

This round of the competition included a new benchmark, called
Mixture of Experts. This is a growing trend in LLM deployment, where a language model is broken up into several smaller, independent language models, each fine-tuned for a particular task, such as regular conversation, solving math problems, and assisting with coding. The model can direct each query to an appropriate subset of the smaller models, or “experts”. This approach allows for less resource use per query, enabling lower cost and higher throughput, says Miroslav Hodak, MLPerf Inference Workgroup Chair and senior member of technical staff at AMD.

The winners on each benchmark within the popular datacenter-closed benchmark were still submissions based on Nvidia’s H200 GPUs and GH200 superchips, which combine GPUs and CPUs in the same package. However, a closer look at the performance results paint a more complex picture. Some of the submitters used many accelerator chips while others used just one. If we normalize the number of queries per second each submitter was able to handle by the number of accelerators used, and keep only the best performing submissions for each accelerator type, some interesting details emerge. (It’s important to note that this approach ignores the role of CPUs and interconnects.)

On a per accelerator basis, Nvidia’s Blackwell outperforms all previous chip iterations by 2.5x on the LLM Q&A task, the only benchmark it was…

Read full article: Blackwell, AMD Instinct, Untethered AI: First Benchmarks

The post “Blackwell, AMD Instinct, Untethered AI: First Benchmarks” by Dina Genkina was published on 08/28/2024 by spectrum.ieee.org