AI-Powered Proof Generator Helps Debug Software

Not all software is perfect—many apps, programs, and websites are released despite bugs. But the software behind critical systems like cryptographic protocols, medical devices, and space shuttles must be error-free, and ensuring the absence of bugs requires going beyond code reviews and testing. It requires formal verification.

Formal verification involves writing a mathematical proof of your code and is “one of the hardest but also most powerful ways of making sure your code is correct,” says Yuriy Brun, a professorat the University of Massachusetts Amherst.

To make formal verification easier, Brun and his colleagues devised a new AI-powered method called Baldur to automatically generate proofs. The accompanying paper, presented in December 2023 at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering in San Francisco, won a Distinguished Paper award. The team includes Emily First, who completed the study as part of her doctoral dissertation at UMass Amherst; Markus Rabe, a former researcher at Google, where the study was conducted; and Talia Ringer, an assistant professor at the University of Illinois Urbana-Champaign.

This flowchart shows at a high level how Baldur generates a proof. University of Massachusetts/ University of Illinois Urbana-Champaign

Baldur is powered by Google’s Minerva large language model (LLM)—which was trained on scientific papers and Web pages containing mathematical expressions—and fine-tuned on data about proofs and theorems. Baldur works with Isabelle, a proof assistant or automated theorem prover,to check its proofs. When given a theorem statement, Baldur is able to generate an entire proof almost 41 percent of the time.

To boost Baldur’s success, the team fed the model additional contextual information—such as other definitions or the lines preceding the theorem statement in a theory file—and found that the proof rate increased to 47.5 percent. This means that Baldur is able to take the context and use it to predict a new correct proof, says First. Similar to a programmer who may be better equipped to fix a bug in a method when they know how that method relates to its surrounding code and the other methods in the same class, Baldur can perform better with extra context.

Thor, the current state-of-the-art tool for automatic proof generation, has a higher proof rate at 57 percent. Baldur’s advantage lies in its ability to generate whole proofs; Thor predicts the next step in a proof using a smaller language model combined with a method that searches the space of possible proofs. But when Thor and Baldur (Thor’s brother in Norse mythology) work together, the pair can generate correct proofs nearly 66 percent of the time.

The team also discovered that Baldur can “repair” its own proofs, further improving its proof rate. When provided with its previous failed attempt plus the error message returned by Isabelle, Baldur can turn its…

Read full article: AI-Powered Proof Generator Helps Debug Software

The post “AI-Powered Proof Generator Helps Debug Software” by Rina Diane Caballar was published on 01/25/2024 by spectrum.ieee.org