AI Self-Recognition Creates Chances for New Security Risks

Given the uncannily human capabilities of the most powerful AI chatbots, there’s growing interest in whether they show signs of self-awareness. Besides the interesting philosophical implications, there could be significant security consequences if they did, according to a team of researchers in Switzerland. That’s why the team has devised a test to see if a model can recognize its own outputs.

The idea that large language models (LLMs) could be self-aware has largely been met with skepticism by experts in the past. Google engineer Blake Lemoine’s claim in 2022 that the tech giant’s LaMDA model had become sentient was widely derided and he was swiftly edged out of the company. But more recently, Anthropic’s Claude 3 Opus caused a flurry of discussion after supposedly displaying signs of self-awareness when it caught out a trick question from researchers. And it’s not just researchers who are growing more credulous: A recent paper found that a majority of ChatGPT users attribute at least some form of consciousness to the chatbot.

The question of whether AI models have self-awareness isn’t just a philosophical curiosity either. Given that most people who are using LLMs are using those provided by a handful of tech companies, these models are highly likely to come across outputs produced by instances of themselves. If an LLM is able to recognize that fact, says Tim Davidson, a Ph.D. student at the École Polytechnique Fédérale de Lausanne in Switzerland, it could potentially be exploited by the model or its user to extract private information from others.

“Just because models right now do not seem to exhibit this capability, it doesn’t mean that a future model wouldn’t be able to.” —Tim Davidson, École Polytechnique Fédérale de Lausanne

However, detecting self-awareness in these models is challenging. Despite centuries of debate, neither philosophers nor scientists can really say what a “self” even is. That’s why Davidson and colleagues decided to tackle a more tractable question: Can an AI model pick out its own response to a question from among several options?

The researchers found that some of the most powerful commercial models could do this fairly reliably. But closer analysis of the results showed that even weaker models were picking the responses of the more powerful ones. That suggests that what models are actually doing is picking the “best” answer rather than demonstrating self-recognition, says Davidson. Nonetheless, he thinks this kind of test could be an important tool going forward.

“Just because models right now do not seem to exhibit this capability, it doesn’t mean that a future model wouldn’t be able to,” Davidson says. “I think the current setup [of this test] is simple, yet flexible enough to at least give us some idea on the progress towards this capability.”

Can LLMs pick out their “own” answers?

The researchers’ approach borrows from the idea of a security…

Read full article: AI Self-Recognition Creates Chances for New Security Risks

The post “AI Self-Recognition Creates Chances for New Security Risks” by Edd Gent was published on 08/02/2024 by spectrum.ieee.org