Smarter Robot Emotions From Vision Language Models

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

As robots advance in terms of dexterity and other physical capabilities, it becomes more likely that humans may find themselves working alongside them. If that happens, how will robots’ emotional capabilities need to advance for them to successfully work with people?

In a recent study, researchers trained collaborative robots to read human emotions by not only accounting for facial expressions, but also contextual factors in the interactions as well. Through experiments with 40 volunteers, the researchers then evaluated how a robot’s ability to read human emotions and adjust its behavior in turn impacted a human’s perception of the robot and its capabilities as the two collaborated on tasks. The results—which show that the emotional capabilities of robots only go so far with humans—were published 18 May in IEEE Robotics and Automation Letters.

Seung Chan Hong led the study as part of his undergraduate thesis while studying at Monash University, in Melbourne, Australia. He notes that, while there has been a lot of hype in the advancing physical abilities of robots, this is only one piece of the puzzle. “We need to also innovate when it comes to them actually interacting with humans, not just their physical capabilities,” he says.

This prompted him to dig deeper into the emotional aspects of human-robot interactions. First, Hong and his co-authors decided to train a robot to read human emotions using a vision language model (VLM), which is similar to large language models (LLMs) such as ChatGPT, but which can also take visual inputs.

Training VLMs for Human Emotion Recognition

To evaluate their VLM, which used Gemini 2.5, the researchers had volunteers watch videos of robots handing over objects to humans—with varying degrees of success—and describe the emotions the humans were expressing. Importantly, the volunteers labeling these videos were able to take into account more context in these interactions, rather than reporting solely on the facial expressions of the humans in the video. For example, a person pausing to think with a furrowed brow may simply be concentrating on their task at hand and not necessarily be angry. Contextual factors such as drumming their fingers, pursing their lips, or other behaviors can point to the real cause of a person’s furrowed brow.

The researchers then compared their VLM to a conventional AI system that relies on standard facial analysis and object tracking that is used in human-robot interactions. They found that the VLM outperformed the traditional approach. On a scale from 0 (no similarity in meaning to the emotion identified by the human volunteers) to 1 (a perfect match in meaning), the conventional AI system achieved a score of 0.77. In comparison, the VLM achieved a score of 0.86.

Hong says, “I think [the VLM] was able to align with what human observers were seeing a lot better, because it…

Read full article: Smarter Robot Emotions From Vision Language Models

The post “Smarter Robot Emotions From Vision Language Models” by Michelle Hampson was published on 06/13/2026 by spectrum.ieee.org