MLCommons Announces Its First Benchmark for AI Safety

One of the management guru Peter Drucker’s most over-quoted turns of phrase is “what gets measured gets improved.” But it’s over-quoted for a reason: It’s true.

Nowhere is it truer than in technology over the past 50 years. Moore’s law—which predicts that the number of transistors (and hence compute capacity) in a chip would double every 24 months—has become a self-fulfilling prophecy and north star for an entire ecosystem. Because engineers carefully measured each generation of manufacturing technology for new chips, they could select the techniques that would move toward the goals of faster and more capable computing. And it worked: Computing power, and more impressively computing power per watt or per dollar, has grown exponentially in the past five decades. The latest smartphones are more powerful than the fastest supercomputers from the year 2000.

Measurement of performance, though, is not limited to chips. All the parts of our computing systems today are benchmarked—that is, compared to similar components in a controlled way, with quantitative score assessments. These benchmarks help drive innovation.

And we would know.

As leaders in the field of AI, from both industry and academia, we build and deliver the most widely used performance benchmarks for AI systems in the world. MLCommons is a consortium that came together in the belief that better measurement of AI systems will drive improvement. Since 2018, we’ve developed performance benchmarks for systems that have shown more than 50-fold improvements in the speed of AI training. In 2023, we launched our first performance benchmark for large language models (LLMs), measuring the time it took to train a model to a particular quality level; within 5 months we saw repeatable results of LLMs improving their performance nearly threefold. Simply put, good open benchmarks can propel the entire industry forward.

We need benchmarks to drive progress in AI safety

Even as the performance of AI systems has raced ahead, we’ve seen mounting concern about AI safety. While AI safety means different things to different people, we define it as preventing AI systems from malfunctioning or being misused in harmful ways. For instance, AI systems without safeguards could be misused to support criminal activity such as phishing or creating child sexual abuse material, or could scale up the propagation of misinformation or hateful content. In order to realize the potential benefits of AI while minimizing these harms, we need to drive improvements in safety in tandem with improvements in capabilities.

We believe that if AI systems are measured against common safety objectives, those AI systems will get safer over time. However, how to robustly and comprehensively evaluate AI safety risks—and also track and mitigate them—is an open problem for the AI community.

Safety measurement is challenging because of the many different ways that AI models are used and the many aspects that need to be…

Read full article: MLCommons Announces Its First Benchmark for AI Safety

The post “MLCommons Announces Its First Benchmark for AI Safety” by MLCommons AI Safety Working Group was published on 04/16/2024 by spectrum.ieee.org