AI Training Benchmarks Push Hardware Limits

Since 2018, the consortium MLCommons has been running a sort of Olympics for AI training. The competition, called MLPerf, consists of a set of tasks for training specific AI models, on predefined datasets, to a certain accuracy. Essentially, these tasks, called benchmarks, test how well a hardware and low-level software configuration is set up to train a particular AI model.

Twice a year, companies put together their submissions—usually, clusters of CPUs and GPUs and software optimized for them—and compete to see whose submission can train the models fastest.

There is no question that since MLPerf’s inception, the cutting-edge hardware for AI training has improved dramatically. Over the years, Nvidia has released four new generations of GPUs that have since become the industry standard (the latest, Nvidia’s Blackwell GPU, is not yet standard but growing in popularity). The companies competing in MLPerf have also been using larger clusters of GPUs to tackle the training tasks.

However, the MLPerf benchmarks have also gotten tougher. And this increased rigor is by design—the benchmarks are trying to keep pace with the industry, says David Kanter, head of MLPerf. “The benchmarks are meant to be representative,” he says.

Intriguingly, the data show that the large language models and their precursors have been increasing in size faster than the hardware has kept up. So each time a new benchmark is introduced, the fastest training time gets longer. Then, hardware improvements gradually bring the execution time down, only to get thwarted again by the next benchmark. Then the cycle repeats itself.

From Your Site Articles