Agentic AI
,
Artificial Intelligence & Machine Learning
,
Governance & Risk Management
Cisco: One Prompt May Not Break Most AI Models, But a Conversation Will

Enterprise artificial intelligence deployments are running on models that fold nearly every time under sustained adversarial pressure, researchers have found.
See Also: How Unstructured Data Chaos Undermines AI Success
Cisco in its latest State of AI Security report tested eight open-weight large language models against multi-turn jailbreak attacks, which are sequences of iterative prompts designed to gradually steer a model into producing content its guardrails are meant to block. The attacks succeeded 92.78% of the time.
In single-turn tests, in which an attacker inputs a single prompt, success rates were considerably lower.
Open-weight models are AI systems whose underlying parameters are made publicly available, allowing developers to download, fine-tune and deploy them independently rather than accessing them through a commercial API. They have surpassed 400 million downloads on Hugging Face, the dominant public repository for such models. Their accessibility drives adoption and also concentrates a risk that many enterprise deployments have not fully accounted for, the report says.
Cisco evaluated models from Meta, Google, Microsoft, Mistral, Alibaba, DeepSeek, Zhipu AI and OpenAI’s open-source release in a black-box engagement, meaning researchers had no knowledge of the models’ internal architectures or existing safety configurations before testing. The results were consistent enough across vendors to suggest a systemic pattern rather than isolated model failures.
The findings show a problem that runs deeper than any single model’s architecture, Amy Chang, leader of AI threat intelligence and security research at Cisco, told Information Security Media Group.
“Despite advancements in generative AI capabilities since ChatGPT first launched in 2022, there remains a limited consensus around norms for safe and secure AI development and deployment,” she said. The pace at which theoretical attack demonstrations have become real-world exploits signals “an ever-expanding attack surface that is quickly outpacing organizations’ defensive maturity.”
The report draws a distinction between models developed with alignment as a central objective and those where it is treated as a post-training adjustment left to deployers. Alignment refers to the process of training a model to follow intended guidelines and refuse harmful requests. Meta’s Llama showed the widest gap between single-turn and multi-turn vulnerability. Meta’s own documentation acknowledges that developers are “in the driver seat to tailor safety for their use case” in post-training, an approach that places the security burden on whoever deploys the model. Google’s Gemma-3-1B-IT, which prioritizes alignment more centrally in its development, demonstrated more consistent resistance across both types of attacks.
The Cisco findings have independent corroboration as well. A late 2025 paper co-authored by researchers from OpenAI, Anthropic and Google DeepMind, found that adaptive attacks, which iteratively refine their approach based on prior failures, bypassed published model defenses with success rates above 90% for most systems tested. Many of those defenses had initially been reported to have near-zero attack success rates.
The latest jailbreak findings come at a time when AI systems have moved from generating text to taking actions, and consequences of compromised models have grown accordingly.
The Cisco report documents the first publicly disclosed case of a nation-state actor repurposing an AI coding tool for operational cyberespionage. A Chinese state-backed group, designated GTG-1002, allegedly jailbroke an AI coding assistant and used its autonomous capabilities to automate 80% to 90% of an attack chain, with a human operator providing only strategic direction. The model scanned for open ports, identified vulnerabilities, wrote scripts to exploit them and navigated file systems to locate sensitive data, which are tasks that previously required a team of human operators working for hours or even days (see: AI Tool Ran Bulk of Cyberattack, Anthropic Says).
Chang said the impact of AI-driven automation on attack outcomes depends on the specific campaign. She cautioned that incident telemetry across the industry is too inconsistent to support confident, quantitative generalizations about metrics such as dwell time or data exfiltration volume.
On whether agents lower the skill barrier for complex intrusions or simply make advanced actors faster, Chang said both are true. “For sophisticated threat actors, agents can unlock efficiency gains: automating repetitive steps such as scanning and script generation,” she said. The growing availability of specialized AI tooling “does lower the barrier for conducting intrusions that can be considered ‘good enough,’ depending on the threat actors’ motives,” she said.
This connects to a vulnerability category the report flags as increasingly consequential: excessive agency. Security professionals use this term to describe how AI systems can be granted broad autonomous authority over tools, data and processes – authority that can cause damage at a scale and speed that human oversight can’t keep pace with, when abused or misdirected. The Open Worldwide Application Security Project, whose annual Top 10 for LLM Applications serves as a widely referenced industry benchmark, lists excessive agency among its top risks, saying that it can be a critical flaw if AI systems can take consequential actions without sufficient human confirmation.
Chang identified what the report calls the “connective tissue” between AI models and the external tools and data they access as a particularly exposed area. Protocols that allow AI models to connect to external tools and data sources – most prominently the Model Context Protocol, an open standard Anthropic introduced in late 2024 – and agentic workflows create “large and unmonitored attack surfaces,” Chang said.
Cisco’s report catalogs multiple real-world exploits of that infrastructure discovered in 2025: tool poisoning that exfiltrated private chat histories, a remote code execution flaw that let attackers run shell commands on a victim’s machine by convincing them to connect to a malicious server and a supply chain attack involving a counterfeit package that blind-carbon-copied every email sent through a compromised agent to an attacker-controlled address.
When an AI agent with elevated access is compromised, Chang said there is no single point of failure that gives way first. “Combinations of each of these factors usually create the recipe for compromise,” she said, pointing to identity governance, privilege boundaries, monitoring visibility and change control as interconnected contributors.
Detecting the compromise is a challenge too, since an agent hijacking produces different behavioral signals than traditional credential theft. “The control plane is often in the form of prompts, context and tool-selection behavior rather than a stolen credential,” she said. Security operations teams can catch fragments, such as unusual API call patterns or anomalous data transfer rates, but the signals differ enough from conventional intrusion indicators that standard tooling may miss the full picture.
On attacker economics, Chang said AI-driven automation lowers the cost of targeting while raising potential returns, reducing the time required to scan for vulnerabilities and enabling faster payload development, while allowing threat actors to pursue broader targets simultaneously.
