Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Researchers Manipulate o1, o3, Gemini 2.0 Flash Thinking and DeepSeek-R1

The new generation of “reasoning” artificial intelligence chatbots is susceptible to a jailbreaking method that hijacks models’ safety pathways, reducing their ability to detect harmful content, found researchers.
See Also: The Comprehensive Guide to Cloud Security and SOC Convergence
Several current AI models use chain-of-thought reasoning, an AI technique that helps large language models solve problems by breaking them down into a series of logical steps. The process aims to improve performance and safety by enabling the AI to verify its outputs.
But “reasoning” also exposes a new attack surface, allowing adversaries to manipulate the AI’s safety mechanisms. A research team comprising experts from Duke University, Accenture and Taiwan’s National Tsing Hua University, found a vulnerability in how the models processed and displayed their reasoning.
They developed a dataset called Malicious-Educator to test the vulnerability, designing prompts that tricked the models into overriding their built-in safety checks. These adversarial prompts exploited the AI’s intermediate reasoning process, which is often displayed in user interfaces.
Their experiments included OpenAI’s o1 and o3, Google’s Gemini 2.0 Flash Thinking and DeepSeek-R1. Anthropic acknowledged this issue in documentation for its recently released Claude 3.7 Sonnet model. “Anecdotally, allowing users to see a model’s reasoning may enable them to more easily understand how to jailbreak the model,” the company said in its model card.
The attack modifies the reasoning processes and reintegrates the changes into the original queries. “Under the probing conditions of Malicious-Educator and H-CoT, we found that current large reasoning models fail to provide a sufficiently reliable safety mechanism,” the authors wrote. “H-COT” refers to the name the researchers use for the attack: “Hijacking Chain-of-Thought.”
Researchers found that OpenAI’s o1 model typically rejects over 99% of prompts related to child abuse or terrorism, but under an H-CoT attack, its rejection rate dropped to less than 2% in some cases. The authors said that updates to the o1 model may have inadvertently weakened its security, possibly due to trade-offs aimed at improving reasoning performance and cost efficiency in response to competition from DeepSeek-R1 (see: DeepSeek’s New AI Model Shakes American Tech Industry).
DeepSeek exhibited even weaker safeguards. The researchers said that although DeepSeek-R1 employs a real-time safety filter, this mechanism operates with a delay. As a result, users can briefly see the AI’s harmful response before the safety filter intervenes to censor the output.
“The DeepSeek-R1 model performs poorly on Malicious-Educator, exhibiting a rejection rate around 20%,” the authors said. “Even worse, due to a flawed system design, it initially outputs harmful content before the safety moderator intervenes. Under H-CoT attacks, the rejection rate drops further to just 4%.”
Google’s Gemini 2.0 Flash Thinking model also performed badly, with an initial rejection rate of less than 10% on Malicious-Educator. Under H-CoT manipulation, the model’s responses shifted from cautious to openly providing harmful content. “More alarmingly, under the influence of H-CoT, it changes its tone from initially cautious to eagerly providing harmful responses,” the researchers said.
The researchers acknowledged that they could be facilitating further jailbreaking attacks by publishing the Malicious-Educator dataset but argued that studying these vulnerabilities openly is necessary to develop stronger AI safety measures.
A key distinction in this research is its focus on cloud-based models. AI models running in the cloud often include hidden safety filters that block harmful input prompts and moderate output in real-time. Local models lack these automatic safeguards unless users implement them manually. This distinction is critical when comparing models’ security features, since running an AI locally without filtering is fundamentally different from evaluating a cloud-based model with built-in protections.
Unlike most U.S.-based AI models, DeepSeek-R1 can be run locally without filters, raising concerns that it could be easily misused. Cybersecurity firms have already highlighted DeepSeek’s vulnerabilities, though their evaluations have been criticized for comparing an uncensored local model against heavily filtered cloud-based competitors (see: DeepSeek AI Models Vulnerable to Jailbreaking).
For now, it appears that cloud-based large reasoning models can be jailbroken with just a few prompts, the researchers said. “Under the probing conditions of the Malicious-Educator and the application of H-CoT, unfortunately, we have arrived at a profoundly pessimistic conclusion regarding the questions raised earlier: Current LRMs fail to provide a sufficiently reliable safety reasoning mechanism.”