AI Math Benchmarks: AI’s Growing Capabilities

Mathematics is often regarded as the ideal domain for measuring AI progress effectively. Math’s step-by-step logic is easy to track, and its definitive automatically verifiable answers remove any human or subjective factors. But AI systems are improving at such a pace that math benchmarks are struggling to keep up.

Contents

AI takes on PhD level mathematics

The First Proof challenge

Way back in November 2024, non-profit research organization Epoch AI quietly released FrontierMath. A standardized, rigorous benchmark, Frontier Math was designed to measure the mathematical reasoning capabilities of the latest AI tools.

“It’s a bunch of really hard math problems,” explains Greg Burnham, Epoch AI Senior Researcher. “Originally, it was 300 problems that we now call tiers 1–3, but having seen AI capabilities really speed up, there was a feeling that we had to run to stay ahead, so now there’s a special challenge set of extra carefully constructed problems that we call tier 4.”

To a rough approximation, tiers 1–4 go from advanced undergraduate through to early postdoc level mathematics. When introduced, state-of-the-art AI models were unable to solve more than 2% of the problems FrontierMath contained. Fast forward to today and the best publicly available AI models, such as GPT-5.2 and Claude Opus 4.6, are solving over 40% of FrontierMath’s 300 tiers 1–3 problems, and over 30% of the 50 tier 4 problems.

AI takes on PhD level mathematics

And this dizzying pace of advancement is showing no signs of abating. For example, just recently Google DeepMind announced that Aletheia, an experimental AI system derived from Gemini Deep Think, achieved publishable PhD level research results. Though obscure mathematically—calculating certain structure constants in arithmetic geometry called eigenweights—the result is significant in terms of AI development.

“They’re claiming it was essentially autonomous, meaning a human wasn’t guiding the work, and it’s publishable,” Burnham says. “It’s definitely at the lower end of the spectrum of work that would get a mathematician excited, but it’s new—it’s something we truly haven’t really seen before.”

To place this achievement in context, every FrontierMath problem has a known answer that a human has derived. Though a human could probably have achieved Aletheia’s result “if they sat down and steeled themselves for a week,” says Burnham, no human had ever done so.

Aletheia’s results and other recent achievements by AI mathematicians point to new, tougher benchmarks being needed to understand AI capabilities, and fast, because existing ones will soon become irrelevant. “There are easier math benchmarks that are already obsolete, several generations of them,” says Burnham. “FrontierMath will probably saturate [meaning state-of-the-art AI models score 100%] within the next two years; could be faster.”