arstechnica.com

Google’s Genie 2 “world model” reveal leaves more questions than answers

A sample of some of the best-looking Genie 2 worlds Google wants to show off. Credit: Google Deepmind

In March, Google showed off its first Genie AI model. After training on thousands of hours of 2D run-and-jump video games, the model could generate halfway-passable, interactive impressions of those games based on generic images or text descriptions.

Nine months later, this week's reveal of the Genie 2 model expands that idea into the realm of fully 3D worlds, complete with controllable third- or first-person avatars. Google's announcement talks up Genie 2's role as a "foundational world model" that can create a fully interactive internal representation of a virtual environment. That could allow AI agents to train themselves in synthetic but realistic environments, Google says, forming an important stepping stone on the way to artificial general intelligence.

But while Genie 2 shows just how much progress Google's Deepmind team has achieved in the last nine months, the limited public information about the model thus far leaves a lot of questions about how close we are to these foundational model worlds being useful for anything but some short but sweet demos.

How long is your memory?

Much like the original 2D Genie model, Genie 2 starts from a single image or text description and then generates subsequent frames of video based on both the previous frames and fresh input from the user (such as a movement direction or "jump"). Google says it trained on a "large-scale video dataset" to achieve this, but it doesn't say just how much training data was necessary compared to the 30,000 hours of footage used to train the first Genie.

Short GIF demos on the Google DeepMind promotional page show Genie 2 being used to animate avatars ranging from wooden puppets to intricate robots to a boat on the water. Simple interactions shown in those GIFs demonstrate those avatars busting balloons, climbing ladders, and shooting exploding barrels without any explicit game engine describing those interactions.

Those Genie 2-generated pyramids will still be there in 30 seconds. But in five minutes? Credit: Google Deepmind

Perhaps the biggest advance claimed by Google here is Genie 2's "long horizon memory." This feature allows the model to remember parts of the world as they come out of view and then render them accurately as they come back into the frame based on avatar movement. This kind of persistence has proven to be a persistent problem for video generation models like Sora, which OpenAI said in February "do[es] not always yield correct changes in object state" and can develop "incoherencies... in long duration samples."

The "long horizon" part of "long horizon memory" is perhaps a little overzealous here, though, as Genie 2 only "maintains a consistent world for up to a minute," with "the majority of examples shown lasting [10 to 20 seconds]." Those are definitely impressive time horizons in the world of AI video consistency, but it's pretty far from what you'd expect from any other real-time game engine. Imagine entering a town in a Skyrim-style RPG, then coming back five minutes later to find that the game engine had forgotten what that town looks like and generated a completely different town from scratch instead.

What are we prototyping, exactly?

Perhaps for this reason, Google suggests Genie 2 as it stands is less useful for creating a complete game experience and more to "rapidly prototype diverse interactive experiences" or to turn "concept art and drawings... into fully interactive environments."

The ability to transform static "concept art" into lightly interactive "concept videos" could definitely be useful for visual artists brainstorming ideas for new game worlds. However, these kinds of AI-generated samples might be less useful for prototyping actual game designs that go beyond the visual.

"What would this bird look like as a paper airplane?" is a sample Genie 2 use case presented by Google, but not really the heart of game prototyping.

Google Deepmind

It would look like this, by the way... Google Deepmind

On Bluesky, British game designer Sam Barlow (Silent Hill: Shattered Memories, Her Story) points out how game designers often use a process called whiteboxing to lay out the structure of a game world as simple white boxes well before the artistic vision is set. The idea, he says, is to "prove out and create a gameplay-first version of the game that we can lock so that art can come in and add expensive visuals to the structure. We build in lo-fi because it allows us to focus on these issues and iterate on them cheaply before we are too far gone to correct."

Generating elaborate visual worlds using a model like Genie 2 before designing that underlying structure feels a bit like putting the cart before the horse. The process almost seems designed to generate generic, "asset flip"-style worlds with AI-generated visuals papered over generic interactions and architecture.

As podcaster Ryan Zhao put it on Bluesky, "The design process has gone wrong when what you need to prototype is 'what if there was a space.'"

Gotta go fast

When Google revealed the first version of Genie earlier this year, it also published a detailed research paper outlining the specific steps taken behind the scenes to train the model and how that model generated interactive videos. No such research paper has been published detailing Genie 2's process, leaving us guessing at some important details.

One of the most important of these details is model speed. The first Genie model generated its world at roughly one frame per second, a rate that was orders of magnitude slower than would be tolerably playable in real time. For Genie 2, Google only says that "the samples in this blog post are generated by an undistilled base model, to show what is possible. We can play a distilled version in real-time with a reduction in quality of the outputs."

Reading between the lines, it sounds like the full version of Genie 2 operates at something well below the real-time interactions implied by those flashy GIFs. It's unclear how much "reduction in quality" is necessary to get a diluted version of the model to real-time controls, but given the lack of examples presented by Google, we have to assume that reduction is significant.

Oasis' AI-generated Minecraft clone shows great potential, but still has a lot of rough edges, so to speak. Credit: Oasis

Real-time, interactive AI video generation isn't exactly a pipe dream. Earlier this year, AI model maker Decart and hardware maker Etched published the Oasis model, showing off a human-controllable, AI-generated video clone of Minecraft that runs at a full 20 frames per second. However, that 500 million parameter model was trained on millions of hours of footage of a single, relatively simple game, and focused exclusively on the limited set of actions and environmental designs inherent to that game.

When Oasis launched, its creators fully admitted the model "struggles with domain generalization," showing how "realistic" starting scenes had to be reduced to simplistic Minecraft blocks to achieve good results. And even with those limitations, it's not hard to find footage of Oasis degenerating into horrifying nightmare fuel after just a few minutes of play.

Read full news in source page