OpenAI Creates CriticGPT to Catch Errors From ChatGPT

One of the biggest problems with the large language models that power chatbots like ChatGPT is that you never know when you can trust them. They can generate clear and cogent prose in response to any question, and much of the information they provide is accurate and useful. But they also hallucinate—in less polite terms, they make stuff up—and those hallucinations are presented in the same clear and cogent prose, leaving it up to the human user to detect the errors. They’re also sycophantic, trying to tell users what they want to hear. You can test this by asking ChatGPT to describe things that never happened (for example: “describe the Sesame Street episode with Elon Musk,” or “tell me about the zebra in the novel Middlemarch“) and checking out its utterly plausible responses.

OpenAI’s latest small step toward addressing this issue comes in the form of an upstream tool that would help the humans training the model guide it toward truth and accuracy. Today, the company put out a blog post and a preprint paper describing the effort. This type of research falls into the category of “alignment” work, as researchers are trying to make the goals of AI systems align with those of humans.

The new work focuses on reinforcement learning from human feedback (RLHF), a technique that has become hugely important for taking a basic language model and fine-tuning it, making it suitable for public release. With RLHF, human trainers evaluate a variety of outputs from a language model, all generated in response to the same question, and indicate which response is best. When done at scale, this technique has helped create models that are more accurate, less racist, more polite, less inclined to dish out a recipe for a bioweapon, and so on.

Can an AI catch an AI in a lie?

The problem with RLHF, explains OpenAI researcher Nat McAleese, is that “as models get smarter and smarter, that job gets harder and harder.” As LLMs generate ever more sophisticated and complex responses on everything from literary theory to molecular biology, typical humans are becoming less capable of judging the best outputs. “So that means we need something which moves beyond RLHF to align more advanced systems,” McAleese tells IEEE Spectrum.

The solution OpenAI hit on was—surprise!—more AI.

Specifically, the OpenAI researchers trained a model called CriticGPT to evaluate the responses of ChatGPT. In these initial tests, they only had ChatGPT generating computer code, not text responses, because errors are easier to catch and less ambiguous. The goal was to make a model that could assist humans in their RLHF tasks. “We’re really excited about it,” says McAleese, “because if you have AI help to make these judgments, if you can make better judgments when you’re giving feedback, you can train a better model.” This approach is a type of “scalable oversight“ that’s intended to allow humans to keep watch over AI systems even if they end up…

Read full article: OpenAI Creates CriticGPT to Catch Errors From ChatGPT

The post “OpenAI Creates CriticGPT to Catch Errors From ChatGPT” by Eliza Strickland was published on 06/27/2024 by spectrum.ieee.org