techcrunch.com

Gemini 2.0, Google’s newest flagship AI, can generate text, images, and speech

![](https://techcrunch.com/wp-content/uploads/2024/08/Screenshot-2024-08-13-at-2.22.15 PM.png?resize=1200,671)

Google’s next major AI model has arrived to combat a slew of new offerings from OpenAI.

On Wednesday, Google announced Gemini 2.0 Flash, which the company says can natively generate images and audio in addition to text. 2.0 Flash can also use third-party apps and services, allowing it to tap into Google Search, execute code, and more.

An experimental release of 2.0 Flash will be available through the Gemini API and Google’s AI developer platforms, AI Studio and Vertex AI, starting today. However, the audio and image generation capabilities are launching only for “early access partners” ahead of a wide rollout in January.

In the coming months, Google says that it’ll bring 2.0 Flash in a range of flavors to products like Android Studio, Chrome DevTools, Firebase, Gemini Code Assist, and others.

Flash, upgraded

The first-gen Flash, 1.5 Flash, could generate only text, and wasn’t designed for especially demanding workloads. This new model is more versatile, Google says, in part because it can call tools like Search and interact with external APIs.

“We know Flash is extremely popular with developers for its … balance of speed and performance,” Tulsee Doshi, head of product for Gemini model at Google, said during a briefing Tuesday. “And with 2.0 Flash, it’s just as fast as ever, but now it’s even more powerful.”

Google claims that 2.0 Flash, which is twice as fast as the company’s Gemini 1.5 Pro model on certain benchmarks, per Google’s own testing, is “significantly” improved in areas like coding and image analysis. In fact, the company says, 2.0 Flash displaces 1.5 Pro as the flagship Gemini model, thanks to its superior math skills and “factuality.”

As alluded to earlier, 2.0 Flash can generate — and modify — images alongside text. The model can also ingest photos and videos, as well as audio recordings, to answer questions about them (e.g. “What did he say?”).

Audio generation is 2.0 Flash’s other key feature, and Doshi described it as “steerable” and “customizable.” For example, the model can narrate text using one of eight voices “optimized” for different accents and languages.

“You can ask it to talk slower, you can ask it to talk faster, or you can even ask it to say something like a pirate,” she added.

Now, I’m duty-bound as a journalist to note that Google didn’t provide images or audio samples from 2.0 Flash. We have no way of knowing how the quality compares to outputs from other models, at least as of the time of writing.

Google says it’s using its SynthID technology to watermark all audio and images generated by 2.0 Flash. On software and platforms that support SynthID — that is, select Google products — the model’s outputs will be flagged as synthetic.

That’s to allay fears of abuse. Indeed, deepfakes are a growing threat. According to ID verification service Sumsub, there was a 4x increase in deepfakes detected worldwide from 2023 to 2024.

Multimodal API

The production version of 2.0 Flash will land in January. But in the meantime, Google is releasing an API, the Multimodal Live API, to help developers build apps with real-time audio and video streaming functionality.

Using the Multimodal Live API, Google says, developers can create real-time, multimodal apps with audio and video inputs from cameras or screens. The API supports the integration of tools to accomplish tasks, and it can handle “natural conversation patterns” such as interruptions — along the lines of OpenAI’s Realtime API.

The Multimodal Live API is generally available as of this morning.

Read full news in source page