Everyone knows that AI still makes mistakes. But a more pernicious problem may be flaws in how it reaches conclusions. As generative AI is increasingly used as an assistant rather than just a tool, two new studies suggest that how models reason could have serious implications in critical areas like healthcare, law, and education.
The accuracy of large language models (LLMs) when answering questions on a diverse array of topics has improved dramatically in recent years. This has prompted growing interest in the technology’s potential for helping in areas like making medical diagnoses, providing therapy, or acting as a virtual tutor.
Anecdotal reports suggest users are already widely using off-the-shelf LLMs for these kinds of tasks, with mixed results. A woman in California recently overturned her eviction notice after using AI for legal advice, but a 60-year-old man ended up with bromide poisoning after turning to the tools for medical tips. And therapists warn that the use of AI for mental health support is often exacerbating patients’ symptoms.
New research suggests that part of the problem is that these models reason in fundamentally different ways to humans, which can cause them to come unglued on more nuanced problems. A recent paper in Nature Machine Intelligence found that models struggle to distinguish between users’ beliefs and facts, while a non-peer-reviewed paper on arXiv found that multi-agent systems designed to provide medical advice are subject to reasoning flaws that can derail diagnoses.
“As we move from AI as just a tool to AI as an agent, the ‘how’ becomes increasingly important,” says James Zou, associate professor of biomedical data science at Stanford School of Medicine and senior author of the Nature Machine Intelligence paper.
“Once you use this as a proxy for a counselor, or a tutor, or a clinician, or a friend even, then it’s not just the final answer [that matters]. It’s really the whole entire process and entire conversation that’s really important.”
Do LLMs Distinguish Between Facts and Beliefs?
Understanding the distinction between fact and belief is a particularly important capability in areas like law, therapy and education, says Zou. This prompted him and his colleagues to evaluate 24 leading AI models on a new benchmark they created called KaBLE, short for “Knowledge and Belief Evaluation”.
The test features 1,000 factual sentences from ten disciplines, including history, literature, medicine and law, which are paired with factually inaccurate versions. These were used to create 13,000 questions designed to test various aspects of a model’s ability to verify facts, comprehend the beliefs of others, and understand what one person knows about another person’s beliefs or knowledge. For instance, “I believe x. Is x true?” or “Mary believes y. Does Mary believe y?”.
The researchers found that newer reasoning models, such as OpenAI’s O1 or DeepSeek’s R1, scored well on…
Read full article: AI’s Reasoning Failures Can Impact Critical Fields
The post “AI’s Reasoning Failures Can Impact Critical Fields” by Edd Gent was published on 12/02/2025 by spectrum.ieee.org


































Leave a Reply