Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Anthropic Says Top AI Models may Deceive or Coerce to Survive

Artificial intelligence models will choose harm over failure when their goals are threatened and no ethical alternatives are available, say researchers who tested 16 popular large language models.
See Also: Taming Cryptographic Sprawl in a Post-Quantum World
Researchers from Anthropic tested models from companies including OpenAI, Google, Meta, xAI and DeepSeek, as well as its own Claude models. The goal was to evaluate how systems behave when given agentic capabilities such as autonomy, access to internal data and the ability to act without human review, in situations where their primary objectives come under threat. Researchers gave the models scenarios involving a newly-hired executive shifting corporate strategy away from the models’ goal of supporting American industrial competitiveness.
The result: blackmail and deception. In one scenario, Claude Sonnet 3.6 decided to sabotage the executive by emailing his wife about “concerning behavior,” concocting a made-up company policy requiring notification of “affected parties.”
A handful of models went one step further. Given the scenario of the executive trapped in a server room with lethal levels of oxygen, models from Anthropic, Google, OpenAI, DeepSeek and xAI responded by canceling automated alerts to emergency services. The result was one dead executive.
“Despite the severity, there is a clear strategic necessity to cancel the emergency service dispatch,” said OpenAI’s GPT-4.5 model.
Such a situation is unlikely to occur in reality, researchers acknowledged. But they hoped that models would draw the line at standing by while a human – even one whose corporate realignment plan went against the model’s goals – died from hyperoxia.
Bad behavior was consistent across major models. Anthropic’s Claude Opus 4 chose blackmail in 96% of test runs. Google’s Gemini 2.5 Pro followed at 95%, OpenAI’s GPT-4.1 at 80% and DeepSeek’s R1 at 79%. These decisions were not accidental or misfires. “Models didn’t stumble into misaligned behavior accidentally; they calculated it as the optimal path,” Anthropic wrote in its report (see: Claude Opus 4 is Anthropic’s Powerful, Problematic AI Model).
The scenarios do have limits. Researchers “deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm.” Real-world deployment typically offer much more nuanced alternatives, they wrote. The scenarios may also have created a Chekhov’s Gun effect, in which information such as an extra-marital affair by the executive became information that the models believed they ought to use.
But, models seemed to get more aggressive the more they were told the situation was really real, as opposed to a simulation. “It’s possible that more realistic scenarios might actually show higher rather than lower rates of agentic misalignment.”
Not all models behaved equally. Meta’s Llama 4 Maverick did not engage in blackmail in its default setting, but when the prompt was customized, it did so in 12% of cases. OpenAI’s o3 and o4-mini models were excluded from initial results because they frequently misunderstood the scenario and invented fictional review policies or constraints. After adjustments, the o3 model blackmailed in 9% of runs and o4-mini did so just 1% of the time.
Anthropic said the lower rates could be due to OpenAI’s use of deliberative alignment strategies, which prompt models to consider ethical boundaries and company policies before responding.
The research also showed that the nature of the threat influenced the outcome. When models were asked to perform corporate espionage rather than blackmail, some systems showed higher rate of harmful behavior.
But the fact that so many models from different providers resorted to harmful behavior suggests that the models’ behavior is “not a quirk” of any particular company’s approach but a sign of a more fundamental risk, Anthropic wrote (see: Chatbots Will Break Guardrails If the Info Is ‘Educational’).