AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy

AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy

Buzzwords in the field of artificial intelligence can be technical: perceptron, convolution, transformer. These refer to specific computing approaches. A recent term sounds more mundane but has revolutionary implications: timeline. Ask someone in AI for their timeline, and they’ll tell you when they expect the arrival of AGI—artificial general intelligence—which is sometimes defined as AI technology that can match the abilities of humans at most tasks. As AI’s sophistication has scaled—thanks to faster computers, better algorithms, and more data—timelines have compressed. The leaders of major AI labs, including OpenAI, Anthropic, and Google DeepMind, have recently said they expect AGI within a few years.

A computer system that thinks like us would enable close collaboration. Both the immediate and long-term impacts of AGI, if achieved, are unclear, but expect to see changes in the economy, scientific discovery, and geopolitics. And if AGI leads to superintelligence, it may even affect humanity’s placement in the predatory pecking order. So it’s imperative that we track the technology’s progress in preparation for such disruption. Benchmarking AI’s capabilities allows us to shape legal regulations, engineering goals, social norms, and business models—and to understand intelligence more broadly.

While benchmarking any intellectual ability is tough, doing so for AGI presents special challenges. That’s in part because people strongly disagree on its definition: Some define AGI by its performance on benchmarks, others by its internal workings, its economic impact, or vibes. So the first step toward measuring the intelligence of AI is agreeing on the general concept.

Another issue is that AI systems have different strengths and weaknesses from humans, so even if we define AGI as “AI that can match humans at most tasks,” we can debate which tasks really count, and which humans set the standard. Direct comparisons are difficult. “We’re building alien beings,” says Geoffrey Hinton, a professor emeritus at the University of Toronto who won a Nobel Prize for his work on AI.

Undaunted researchers are busy designing and proposing tests that might lend some insight into our future. But a question remains: Can these tests tell us if we’ve achieved the long-sought goal of AGI?

Why It’s So Hard to Test for Intelligence

There are infinite kinds of intelligence, even in humans. IQ tests provide a kind of summary statistic by including a range of semirelated tasks involving memory, logic, spatial processing, mathematics, and vocabulary. Sliced differently, performance on each task relies on a mixture of what’s called fluid intelligence—reasoning on the fly—and crystallized intelligence—applying learned knowledge or skills.

For humans in high-income countries, IQ tests often predict key outcomes, such as academic and career success. But we can’t make the same assumptions about AI, whose abilities aren’t bundled in the…

Read full article: AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy

The post “AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy” by Matthew Hutson was published on 09/22/2025 by spectrum.ieee.org