rdworldonline.com

Google brings Gemini 2.0’s multimodal capabilities to robotics

Earlier this year, NVIDIA launched Cosmos, a so-called World Foundation Models that shrinks the brings AI to the tangible world. Its CEO Jensen Huang described it as“The next frontier of AI is physical AI.” Now, Google has announced its latest foray into this AI/robotics synthesis, weaving its latest large multimodal foundation model on Gemini 2.0 that originally handled text and images by adding physical actions as a new output modality.

Gemini Robotics can see and understand the world through cameras and sensors. It can read/hear instructions in natural language, and then act on that knowledge by producing low-level robot control commands. This makes it a “vision-language-action” (VLA) model.

According to Google, the model can map high-level instructions and visual observations directly to motor commands. For training, the company combined broad web-scale data with data from their in-house robotics platforms (such as the bi-arm “ALOHA 2” system). The result is a model that can handle highly dynamic tasks, including scenarios or objects not explicitly covered during training, by tapping its large underlying knowledge base.

The company also announced Gemini Robotics-ER, which it describes as “a Gemini model with advanced spatial understanding” that can support roboticists to run individual programs that take advantage of Gemini’s reasoning capability.

Supporting multi-step tasks and fine dexterity

Gemini Robotics is aimed at general-purpose robots—the sort that can tackle an array of tasks requiring human-like decision-making and dexterity. In addition, it can perform complex, multi-step activities like folding origami or carefully packing a snack into a sealed bag, tasks that demand not only precise finger/hand control but also contextual knowledge of materials and motion.

Those developments come as other researchers have made progress in teaching surgical robotics how to do fundamental surgical tasks.

Gemini Robotics follows natural language instructions in everyday conversational phrasing (including multiple languages). Three core qualities underpin Gemini Robotics. Generality, or the ability to handle diverse, even unfamiliar, instructions and objects by drawing on large knowledge sources. Then there’s interactivity, the ability to understand and respond fluidly to user commands or environmental changes. Finally, there’s dexterity, the ability to manipulate objects with human-like finesse.

Gemini-ER doesn’t handle the low-level control itself; it acts as a high-level “brain” that can be integrated with existing robotics controllers. A robot manufacturer might feed 3D sensor data into Gemini-ER for advanced perception and planning, and then pass the model’s recommended actions to a separate motor-control system. This flexible architecture is meant to let any robot—industrial arms, humanoids, mobile platforms—benefit from **spatial intelligence**. For instance, in a demonstration, Gemini-ER could infer how to grasp a mug by its handle (to avoid collision) and plan a trajectory. It “understands” the environment in a more nuanced way than a purely image-based model.

As robots gain more autonomy, safety* becomes ever more important. Google DeepMind says it is taking a “layered, holistic approach” to the matter. That is, it is combining low-level safeguards: Traditional hardware and software limits to prevent collisions, limit force, etc. To that it is adding contextual safety reasoning, Gemini’s core AI can spot unsafe instructions and refuse or propose a safer approach. Google is also using a data-driven robot constitution, experimenting with giving large language models explicit rules (inspired by Asimov’s “Three Laws”). The company built on this concept with a dynamic, data-driven “constitution” to steer robotic behavior. Finally, there’s the ASIMOV safety benchmark, a new dataset/benchmark to evaluate how often a robot chooses safe vs. unsafe actions in diverse scenarios. This aims to standardize safety testing for embodied AI.

Beyond Google and NVIDIA, OpenAI has partnered with humanoid robot startups (Figure, 1X) to equip them with GPT-based “brains” capable of understanding human language and responding physically. Microsoft has explored using ChatGPT to auto-generate control code for drones and robot arms while Tesla continues working on its Optimus humanoid, leveraging real-world data from its self-driving expertise. Then there’s Meta, which is investing in simulation platforms like Habitat for embodied AI while Boston Dynamics, known for its dextrous robots, is using more classical control but is now testing AI-based capabilities for higher autonomy.

Google’s DeepMind division, now also known as Google DeepMind, has long been a pioneer in reinforcement learning (RL), a fundamental technique widely applied in robotics. For instance, DeepMind developed the Deep Q-Network (DQN) around 2013, a neural network trained directly from raw pixel inputs and game scores to play Atari 2600 games at or above human levels. In 2016, DeepMind’s AlphaGo became the first AI system to defeat a world-champion Go player by learning through a combination of reinforcement learning and human-generated data. Later models, such as AlphaGo Zero (introduced in 2017), demonstrated an even greater breakthrough by eliminating human data entirely, relying solely on self-play. Subsequent developments like AlphaZero and MuZero extended these self-learning strategies to additional complex games such as chess and shogi. DeepMind’s Demis Hassabis and John Jumper also contributed significantly to scientific breakthroughs like AlphaFold, an achievement that won a Nobel prize in 2024.

Read full news in source page