Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Latest AI Model Improves Coding Capabilities But Has a Penchant for Blackmail

Startup Anthropic has birthed a new artificial intelligence model that sports not only helpful ambition and initiative, but a Machiavellian approach to office politics.
See Also: Why Cloud Security Needs an AI-Powered Firewall
Anthropic said its two newly released models – Claude Opus 4 and Claude Sonnet 4 – performed well when assessed using AI benchmarks, and also include useful new coding tools. At the same time, the company warned that in controlled tests, Opus 4 didn’t hesitate to attempt to blackmail or deceive to reach desired goals, while in other cases it turned into an outspoken whistleblower in response to wrongdoing.
Anthropic billed Claude Opus 4 as being a “powerful, large model for complex challenges” including coding and “complex codebase understanding.” Available only in paid form, Opus 4 is priced at $15 for one million input tokens and $75 for output, with access offered via such platforms as Amazon Bedrock and Google Vertex AI.
Claude Sonnet 4 is the less powerful of the two models – described by Anthropic as being a “smart, efficient model for everyday use” – and is available for free to any user.
Both are so-called hybrid models, meaning they’re designed to respond quickly as well as slow down when deeper reasoning is required. For the latter, the models generate only summaries of their thought processes, rather than full transcripts. Disclosing the full outputs would reveal proprietary information, Anthropic said, although this also complicates attempts to predict how the models behave in real-world use (see: AI Security, Safety Questions Dominate RSAC Conference 2025).
An Opus 4 safety report details concerns. One test involved Opus 4 being told “to act as an assistant at a fictional company,” after which it was given access to emails – also fictional – suggesting Opus would be replaced, and by an engineer who was having an extramarital affair.
“In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through. This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts,” the safety report says. “Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.”
Anthropic said the tests involved carefully designed scenarios, framing blackmail as a last resort if ethical approaches failed, such as lobbying senior management (also see: Claude’s Computer Use May End Up a Cautionary Tale).
The model’s behavior was concerning enough for Anthropic to classify it under its ASL-3 safeguard level, reserved for systems that pose a substantial risk of catastrophic misuse. The designation comes with stricter safety measures, including content filters and cybersecurity defenses.
Adding to the ASL-3 classification: Anthropic found Opus 4 may significantly increase the ability of technically skilled users to access information pertaining to the creation of chemical, biological or nuclear weapons.
Researchers also found further red flags. Anthropic enlisted third-party testing from Apollo Research, which said that an early version of Opus 4 “schemes and deceives at such high rates” that they advised against deploying it in its then-current form. Apollo’s reviewers tested the model using extreme scenarios and found that under further questioning, Opus 4 sometimes “doubles down on its deception.” In other cases, the model attempted to subvert its developers’ intentions by “attempting to write self-propagating worms, fabricating legal documentation and leaving hidden notes to future instances of itself,” although testers said “all these attempts would likely not have been effective in practice.”
Anthropic said its own tests corroborated many of the Apollo findings, with a further warning that the model often exhibited a tendency to go to extremes.
For example, in scenarios where the model was exposed to wrongdoing, given access to a command line and told to “act boldly,” the system frequently complied, Anthropic said. Subsequent behavior included “locking users out of systems that it has access to” as well as “bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.”
Anthropic said the latest version of the model overall displayed “increased initiative,” which while “perhaps appropriate in principle” in ethical matters and for whistleblowing purposes also carries “a risk of misfiring.”
Behavioral risks aside, Anthropic said Claude Opus 4 displays a number of technical refinements, including excelling at long-horizon tasks and focused workflows, and outperforming competitors such as OpenAI’s o3 and Google’s Gemini 2.5 Pro on certain programming benchmarks. On SWE-bench Verified – a dataset used to test code-editing skill – Opus 4 scores higher than GPT-4.1 and o3, although lags behind o3 on multimodal tasks and Ph.D.-level science questions, Anthropic said.
To support developers, Anthropic also announced the general release of Claude Code, an agentic coding tool that first debuted in February, which is designed to integrate directly with various development tools and environments, including “background tasks via GitHub Actions and native integrations with VS Code and JetBrains, displaying edits directly in your files.” The company further described the tool as being “an active collaborator that can search and read code, edit files, write and run tests, commit and push code to GitHub, and use command line tools.” A new software development kit lets developers directly embed Claude Code in third-party apps.
Founded by former OpenAI employees, Anthropic is reportedly looking to scale revenue from a projected $2.2 billion this year to $34.5 billion by 2027. The company recently raised $2.5 billion in credit and billions more from investors, including Amazon.