OpenAI Demos a Control Method for Superintelligent AI

One day, the theory goes, we humans will create AI systems that outmatch us intellectually. That could be great if they solve problems that we’ve been thus far unable to crack (think cancer or climate change), or really bad if they begin to act in ways that are not in humanity’s best interests, and we’re not smart enough to stop them.

So earlier this year, OpenAI launched its superalignment program, an ambitious attempt to find technical means to control a superintelligent AI system, or “align” it with human goals. OpenAI is devoting 20 percent of its compute to this effort, and hopes to have solutions by 2027.

The biggest challenge for this project: “This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to,” says Collin Burns, a member of OpenAI’s superalignment team. “This makes it very tricky to study—but I think we also have no choice.”

The first preprint paper to come out from the superalignment team showcases one way the researchers tried to get around that constraint. They used an analogy: Instead of seeing whether a human could adequately supervise a superintelligent AI, they tested a weak AI model’s ability to supervise a strong one. In this case, GPT-2 was tasked with supervising the vastly more powerful GPT-4. Just how much more powerful is GPT-4? While GPT-2 has 1.5 billion parameters, GPT-4 is rumored to have 1.76 trillion parameters (OpenAI has never released the figures for the more powerful model).

It’s an interesting approach, says Jacob Hilton of the Alignment Research Center; he was not involved with the current research, but is a former OpenAI employee. “It has been a long-standing challenge to develop good empirical testbeds for the problem of aligning the behavior of superhuman AI systems,” he tells IEEE Spectrum. “This paper makes a promising step in that direction and I am excited to see where it leads.”

“This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to.” —Collin Burns, OpenAI

The OpenAI team gave the GPT pair three types of tasks: chess puzzles, a set of natural language processing (NLP) benchmarks such as commonsense reasoning, and questions based on a dataset of ChatGPT responses, where the task was predicting which of multiple responses would be preferred by human users. In each case, GPT-2 was trained specifically on these tasks—but since it’s not a very large or capable model, it didn’t perform particularly well on them. Then its training was transferred over to a version of GPT-4 with only basic training and no fine-tuning for these specific tasks. But remember: GPT-4 with only basic training is still a much more capable model than GPT-2.

The researchers wondered whether GPT-4 would make the same mistakes as its supervisor, GPT-2, which had essentially given it instructions for how to do the tasks. Remarkably, the stronger model…

Read full article: OpenAI Demos a Control Method for Superintelligent AI