ChatGPT and Other AI Chatbots Got Worse as They Got Bigger

AI chatbots such as ChatGPT and other applications powered by large language models have found widespread use, but are infamously unreliable. A common assumption is that scaling up the models driving these applications will improve their reliability—for instance, by increasing the amount of data they are trained on, or the number of parameters they use to process information. However, more recent and larger versions of these language models have actually become more unreliable, not less, according to a new study.

Large language models (LLMs) are essentially supercharged versions of the autocomplete feature that smartphones use to predict the rest of a word a person is typing. ChatGPT, perhaps the most well-known LLM-powered chatbot, has passed law school and business school exams, successfully answered interview questions for software-coding jobs, written real estate listings, and developed ad content.

But LLMs frequently make mistakes. For instance, a study in June found that ChatGPT has an extremely broad range of success when it comes to producing functional code—with a success rate ranging from a paltry 0.66 percent to 89 percent—depending on the difficulty of the task, the programming language, and other factors.

Research teams have explored a number of strategies to make LLMs more reliable. These include boosting the amount of training data or computational power given to the models, as well as using human feedback to fine-tune the models and improve their outputs. And LLM performance has overall improved over time. For instance, early LLMs failed at simple additions such as “20 + 183.” Now LLMs successfully perform additions involving more than 50 digits.

However, the new study, published last week in the journal Nature, finds that “the newest LLMs might appear impressive and be able to solve some very sophisticated tasks, but they’re unreliable in various aspects,” says study coauthor Lexin Zhou, a research assistant at the Polytechnic University of Valencia in Spain. What’s more, he says, “the trend does not seem to show clear improvements, but the opposite.”

This decrease in reliability is partly due to changes that made more recent models significantly less likely to say that they don’t know an answer, or to give a reply that doesn’t answer the question. Instead, later models are more likely to confidently generate an incorrect answer.

How the LLMs fared on easy and tough tasks

The researchers explored several families of LLMs: 10 GPT models from OpenAI, 10 LLaMA models from Meta, and 12 BLOOM models from the BigScience initiative. Within each family, the most recent models are the biggest. The researchers focused on the reliability of the LLMs along three key dimensions.

One avenue the scientists investigated was how well the LLMs performed on tasks that people considered simple and ones that humans find difficult. For instance, a relatively easy task was adding 24,427 and 7,120, while a very difficult one was…

Read full article: ChatGPT and Other AI Chatbots Got Worse as They Got Bigger

The post “ChatGPT and Other AI Chatbots Got Worse as They Got Bigger” by Charles Q. Choi was published on 10/03/2024 by spectrum.ieee.org