Generative AI models are getting closer to taking action in the real world. Already, the big AI companies are introducing AI agents that can take care of web-based busywork for you, ordering your groceries or making your dinner reservation. Today, Google DeepMind announcedtwo generative AI models designed to power tomorrow’s robots.
The models are both built on Google Gemini, a multimodal foundation model that can process text, voice, and image data to answer questions, give advice, and generally help out. DeepMind calls the first of the new models, Gemini Robotics, an “advanced vision-language-action model,” meaning that it can take all those same inputs and then output instructions for a robot’s physical actions. The models are designed to work with any hardware system, but were mostly tested on the two-armed Aloha 2 system that DeepMind introduced last year.
In a demonstration video, a voice says: “Pick up the basketball and slam dunk it” (at 2:27 in the video below). Then a robot arm carefully picks up a miniature basketball and drops it into a miniature net—and while it wasn’t a NBA-level dunk, it was enough to get the DeepMind researchers excited.
Google DeepMind released this demo video showing off the capabilities of its Gemini Robotics foundation model to control robots. Gemini Robotics
“This basketball example is one of my favorites,” said Kanishka Rao, the principal software engineer for the project, in a press briefing. He explains that the robot had “never, ever seen anything related to basketball,” but that its underlying foundation model had a general understanding of the game, knew what a basketball net looks like, and understood what the term “slam dunk” meant. The robot was therefore “able to connect those [concepts] to actually accomplish the task in the physical world,” says Rao.
What are the advances of Gemini Robotics?
Carolina Parada, head of robotics at Google DeepMind, said in the briefing that the new models improve over the company’s prior robots in three dimensions: generalization, adaptability, and dexterity. All of these advances are necessary, she said, to create “a new generation of helpful robots.”
Generalization means that a robot can apply a concept that it has learned in one context to another situation, and the researchers looked at visual generalization (for example, does it get confused if the color of an object or background changed), instruction generalization (can it interpret commands that are worded in different ways), and action generalization (can it perform an action it had never done before).
Parada also says that robots powered by Gemini can better adapt to changing instructions and circumstances. To demonstrate that point in a video, a researcher told a robot arm to put a bunch of plastic grapes into the clear Tupperware container, then proceeded to shift three containers around on the table in an…
Read full article: Gemini Robotics: Google DeepMind’s New AI Models for Robots

The post “Gemini Robotics: Google DeepMind’s New AI Models for Robots” by Eliza Strickland was published on 03/12/2025 by spectrum.ieee.org
Leave a Reply