Visual Reasoning in AI: Boosting Problem-Solving with Images

When humans try to solve problems, they often visualize the tasks in their heads. New research suggests that enabling artificial intelligence to do the same could boost performance on spatial reasoning challenges.

While large language models excel at many text-based tasks, they often struggle with those that require more complex reasoning. One of the most promising approaches for boosting their performance on these kinds of problems is a technique known as “chain-of-thought” (CoT) prompting, where users ask the model to “think” through them step-by-step.

This can lead to significant improvements on various reasoning tasks, especially in mathematics, coding, and logic. But the language-focused technique has proved less effective for problems requiring spatial or visual reasoning. To try and close that gap, researchers at the University of Cambridge and Microsoft Research have developed a new approach that lets AI “think” in both text and images.

The technique enables multimodal large language models, which can process both image and text data, to generate visual representations of their intermediate reasoning steps. In non-peer reviewed research posted to arXiv, the researchers report that when they tested the approach on spatial reasoning challenges involving 2D mazes, they saw significant improvements over the typical CoT technique on the most challenging scenarios.

“Spatial relations and layouts and also some geometric features are very hard to describe with pure text,” says co-lead author Chengzu Li, a Ph.D. student at Cambridge. “That’s why we think that reasoning with pure text would limit the performance of the model in spatial tasks. And that’s the main motivation for introducing visual ‘thoughts,’” he says.

How AI Visual Reasoning Works

This is not the first attempt to allow AI to reason visually. But Li says previous approaches have either involved extracting information from images and converting it to text before reasoning with it, or have relied on external software tools or specialized vision models to enable visual reasoning.

The new approach enables a single multimodal model to generate both visual and text reasoning steps itself. This work only recently became feasible, says Li, thanks to the development of more powerful multimodal AI. Older models could interpret images and text, but could only generate text outputs. For these experiments, the researchers used a model called Anole that can respond in either modality.

This model is an open-source extension of Meta’s Chameleon multimodal model: theresearchers behind Anole retrained it to generate sequences of text interleaved with images. For instance, it can generate a step-by-step recipe with an image for each step. Li and colleagues took this pre-trained model and fine-tuned it on text and image data from three maze-like games with different levels of complexity. They called their fine-tuned version Multimodal Visualization of Thought (MVoT).

The…

Read full article: Visual Reasoning in AI: Boosting Problem-Solving with Images

The post “Visual Reasoning in AI: Boosting Problem-Solving with Images” by Edd Gent was published on 02/12/2025 by spectrum.ieee.org