ChatGPT Sucks at Checking Its Own Code

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

There’s a lot of hype around ChatGPT’s ability to produce code and, so far, the AI program just isn’t on par with its human counterparts. But how good is the AI program at catching its own mistakes?

Researchers in China put ChatGPT to the test in a recent study, evaluating its ability to assess its own code for correctness, vulnerabilities and successful repairs. The results, published 5 November in IEEE Transactions on Software Engineering, show that the AI program is overconfident, often suggesting that code is more satisfactory than it is in reality. The results also show what sort of prompts and tests might improve ChatGPT’s self-verification abilities.

Xing Hu, an associate professor at Zhejiang University, led the study. She emphasizes that, with the growing use of ChatGPT in software development, ensuring the quality of its generated code has become increasingly important.

Hu and her colleagues first tested ChatGPT-3.5’s ability to produce code using several large coding datasets.

Their results show that it can generate “correct” code—code that does what it’s suppose to do—with an average success rate of 57 percent, generate code without security vulnerabilities with a success rate of 73 percent, and repair incorrect code with an average success rate of 70 percent.

So it is successful sometimes, but it still making quite a few mistakes.

Asking ChatGPT to Check Its Coding Work

First, the researchers asked ChatGPT-3.5 to check its own code for correctness using direct prompts, which involve asking it to check whether the code meets a specific requirement.

Thirty-nine percent of the time it erroneously said that code was correct when it was not. It also incorrectly said that code was free of security vulnerabilities 25 percent of the time, and that it had successfully repaired code when it had not 28 percent of the time.

Interestingly, ChatGPT was able to catch more of its own mistakes when the researchers gave it guiding questions, which ask ChatGPT to agree or disagree with assertions that the code does not meet the requirements. Compared to direct prompts, these guiding questions led to the increased detection of incorrectly generated code by an average of 25 percent, increased identification of vulnerabilities by 69 percent, and increased recognition of failed program repairs by 33 percent.

Another important finding was that, although asking ChatGPT to generate test reports was not more effective than direct prompts at identifying incorrect code, it was useful for increasing the number of vulnerabilities flagged in ChatGPT-generated code.

Hu and her colleagues report in this study that ChatGPT demonstrated some instances of self-contradictory hallucinations in its behavior, where it initially generated code or completions that it deems correct or secure but later contradicts this belief during self-verification.

“The…

Read full article: ChatGPT Sucks at Checking Its Own Code

The post “ChatGPT Sucks at Checking Its Own Code” by Michelle Hampson was published on 12/05/2024 by spectrum.ieee.org