Agentic AI
                                                    ,
                                                            Artificial Intelligence & Machine Learning
                                                    ,
                                                            Next-Generation Technologies & Secure Development
                                                    
                    OpenAI’s New Agent Automates Tasks, Amid Limits and Privacy Concerns
                

OpenAI’s new ChatGPT Agent can code, browse and send email. Marketed as a digital executive assistant, the agent is designed to automate complex, multi-step workflows like generating reports, analyzing spreadsheets or sourcing candidates. It can operate apps like Gmail, GitHub and Google Sheets, fluidly switching between tools in a virtual environment that mimics a desktop operating system.
See Also: Proof of Concept: Rethinking Identity for the Age of AI Agents
But whether it can reliably perform these tasks, and whether users should trust it with sensitive information, is an open question.
The agent runs entirely in OpenAI’s sandboxed infrastructure. The company said it does not touch a user’s local device, instead using a virtual browser, file system and operating system controlled by OpenAI. The interface appears in ChatGPT’s dropdown menu and is being rolled out to Pro, Team, Enterprise and Education subscribers.
OpenAI said the agent “carries out these tasks using its own virtual computer, fluidly shifting between reasoning and action to handle complex workflows from start to finish, all based on your instructions.”
Its performance is mixed. In structured benchmarks, the agent posted impressive scores. On DSBench, which evaluates data analysis and modeling skills, it scored nearly 90%, which is 20 points ahead of average human users. It also performed well in BrowseCamp for web search and SpreadsheetBench for spreadsheet tasks, though OpenAI used different tooling than benchmark authors, complicating comparisons.
But its ability to handle open-ended, real-world tasks is far less reliable. In a cybersecurity simulation that tested complex reasoning and threat analysis, the agent failed to complete its mission even after receiving additional clues. OpenAI also admitted that its failure in the test indicated that the agent still struggles to generalize beyond its training patterns.
“How good is it? Unlike its predecessor Operator, Agent can actually do useful things,” wrote Dominik Lukes, lead business technologist at the University of Oxford. “But they need to be the right things.”
In practice, that means the agent excels at tightly-scoped, well-structured workflows like finding names, drafting content or automating click-heavy tasks, but struggles with ambiguity, creativity or judgment-heavy assignments.
“Can ChatGPT Agent source candidates? Yes, it can,” said AI advisor Johannes Sundlo. “Will this change EVERYTHING? No. Not right now.”
These limits come alongside new risks. Because the agent can read emails, access calendars and interact with third-party platforms, it demands elevated permissions that introduce privacy and security concerns. “The privacy and security risks of letting an AI agent perform a task will greatly outweigh any productivity benefits it can offer,” warned Luiza Jarovsky, co-founder of the AI, Tech & Privacy Academy. “But people will use AI agents anyway, because of hype, curiosity, or because their company is ‘AI first’.”
OpenAI says it has guardrails to mitigate such risks. Users must confirm sensitive actions like sending emails or making purchases, and the agent shows its reasoning process in ‘Watch Mode’ so users can intervene. The system includes classifiers designed to detect and block prompt injection, which is malicious text embedded in websites that could hijack the agent’s behavior. OpenAI says it does not log sensitive information like passwords during these automated sessions.
Agent sessions also run with memory off by default, minimizing the risk of long-term data leakage. Users can erase all past agent activity with a one-click ‘clear browsing data’ option.
Some parts of the system are still underdeveloped. A slide deck generator is live but “rudimentary,” said OpenAI. The agent’s math abilities in FrontierMath and general knowledge skills in Humanity’s Last Exam are modest. And the agent is not yet available in the European Economic Area or Switzerland due to trading bloc regulations (see: AI Boss Fails Spectacularly in Month-Long Business Test).
OpenAI plans to sunset its earlier automation tool, Operator, in favor of this more capable ChatGPT Agent, which is being positioned as the future interface for tool-based task automation (see: OpenAI Launches AI Agent ‘Operator’).
The agent can do many of the things OpenAI says it can, but only under the right conditions and only if users are willing to give up a significant amount of trust and data in return.
