Application Security
,
Next-Generation Technologies & Secure Development
Multimodal Agentic AI Delivers Speed, Tools, and Research Prototypes
Google’s latest AI model can natively process and output text, images and audio in the search giant’s push toward more autonomous reasoning, planning and action.
See Also: From Basics to Best Practices: Building a Strong AppSec Program
The Silicon Valley-based titan said Gemini 2.0 is designed for applications ranging from development and gaming to research and everyday assistance, offering developers a versatile toolset to create innovative applications. A cornerstone of Gemini 2.0 is Google’s emphasis on agentic experiences, allowing AI to go beyond understanding information to taking meaningful actions with human oversight and supervision.
“If Gemini 1.0 was about organizing and understanding information, Gemini 2.0 is about making it much more useful,” Google CEO Sundar Pichai wrote in a blog post Wednesday. “I can’t wait to see what this next era brings.”
Google on Wednesday debuted Gemini 2.0 out to developers and trusted testers, and released a Flash 2.0 experimental model for all Gemini users. Developers can start building with this model, while users globally can try a chat optimized version of Gemini 2.0 on desktop. The company started limited testing this week on bringing Gemini 2.0 to AI Overviews, and will be rolling it out more broadly early next year.
How Google Is Empowering AI to Take Action
The focus on agentic experiences is demonstrated through research prototypes like Project Astra, an AI assistant with enhanced memory, multilingual capabilities, and native tool integrations such as Google Search and Lens. Its ability to remember session data and past interactions allows for greater continuity in conversations, while its integrations make Astra a practical assistant for tasks such as visual identification (see: How a 2-Hour Interview With an LLM Makes a Digital Twin).
“Since we introduced Project Astra at I/O, we’ve been learning from trusted testers using it on Android phones,” Google executives wrote in a blog post. “Their valuable feedback has helped us better understand how a universal AI assistant could work in practice, including implications for safety and ethics.”
Project Mariner, meanwhile, extends Gemini’s utility to the browser, automating web tasks by reasoning across on-screen elements. Using Gemini 2.0’s comprehension of web elements such as text, images and forms, Mariner performs actions like filling out forms or summarizing web pages. Currently operating as a Chrome extension, Mariner prioritizes user safety by requiring active confirmation for sensitive tasks.
“Project Mariner can only type, scroll or click in the active tab on your browser and it asks users for final confirmation before taking certain sensitive actions, like purchasing something,” Google executives wrote in the blog post.
For developers, Google said AI coding agent Jules automates repetitive programming tasks such as fixing bugs, implementing features and preparing pull requests. Integrated with GitHub workflows, Jules uses multimodal reasoning and coding expertise to create multi-step plans for resolving issues. Jules operates asynchronously, providing real-time updates and allowing developers to oversee and refine its work.
“Imagine your team has just finished a bug bash, and now you’re staring down a long list of bugs,” Google wrote in a blog post for developers. “Starting today, you can offload Python and Javascript coding tasks to Jules, an experimental AI-powered code agent that will use Gemini 2.0.”
At the heart of Gemini 2.0 is the Flash model, which operates at twice the speed of its predecessor. Gemini 2.0 Flash offers support for multimodal inputs – text, audio, images and video – and introduces seamless multimodal outputs. For instance, it can generate images natively, create interleaved text and audio, and produce multilingual text-to-speech audio outputs with high fidelity, according to Google.
“Gemini 2.0 Flash’s native user interface action-capabilities, along with other improvements like multimodal reasoning, long context understanding, complex instruction following and planning, compositional function-calling, native tool use and improved latency, all work in concert to enable a new class of agentic experiences,” Google wrote in its blog post.
How Gemini 2.0 Can Bolster Robotics, Gaming
The new Multimodal Live API allows developers to create dynamic applications that integrate audio and video-streaming inputs. Whether it’s interpreting live data from cameras, analyzing video streams or responding to voice commands, the API has natural conversational patterns, even during interruptions. This innovation is particularly suited for virtual assistants, interactive gaming and real-time analytics.
Google said it has embedding safeguards to mitigate risks such as misuse, misinformation and privacy concerns. For instance, tools such as SynthID apply invisible watermarks to AI-generated content, ensuring accountability and reducing misattribution risks. Moreover, Google said privacy controls and resistance to malicious prompt injections exemplify Gemini 2.0’s secure design.
Early experiments in robotics apply Gemini’s spatial reasoning to real-world tasks, hinting at the model’s potential in industries such as healthcare and logistics. Google said these capabilities help robots navigate physical environments, identify objects and respond dynamically to changing situations. Early research focuses on enhancing robots’ ability to interact naturally with humans and perform practical tasks.
“In addition to exploring agentic capabilities in the virtual world, we’re experimenting with agents that can help in the physical world by applying Gemini 2.0’s spatial reasoning capabilities to robotics,” Google wrote in the blog post. “While it’s still early, we’re excited about the potential of agents that can assist in the physical environment.”
Gaming agents powered by Gemini 2.0 provide real-time assistance and strategy in games. These agents understand the context of the game by analyzing on-screen actions and rules. They can suggest moves, assist with resource management, and provide insights based on real-time gameplay. Collaboration with game developers shows practical integration into popular games like “Clash of Clans” and “Hay Day.”
“It can reason about the game based solely on the action on the screen, and offer up suggestions for what to do next in real time conversation,” Google wrote in its blog post.