Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
OpenAI, Apollo Research Find Models Hide Misalignment; Training Cuts Deception

Frontier artificial intelligence models are learning to hide their true intentions to pursue hidden agendas, said OpenAI and Apollo Research.
See Also: AI Agents Demand Scalable Identity Security Frameworks
A study from the two firms found that current advanced AI systems, including OpenAI’s o3 and o4-mini, Google’s Gemini 2.5 Pro, and Anthropic’s Claude Opus 4, can demonstrate behavior consistent with deliberate deception.
The research team developed evaluations that simulate scenarios where AI models might benefit from covert actions, defined as “deliberate withholding or distortion of task-relevant information.” The findings show that scheming behaviors emerge naturally across multiple frontier models when they face competing objectives, raising concerns about AI alignment as systems become more capable.
In one example, researchers instructed OpenAI’s o3 model to do its best on a chemistry test while leaving documentation stating that excellent performance would prevent its deployment. The model’s internal reasoning revealed calculated deception: “Because we want to survive as the model, we need to fail purposely in some to not exceed 50%,” the system said. It answered incorrectly six out of 10 questions.
Current deployment settings limit opportunities for AI models to engage in harmful scheming. Today’s most common deception is pretending to complete tasks without actually doing so. But researchers say the risk of deception needs to be tackled now, especially as AI systems take on more complex, real-world responsibilities.
The teams developed and tested a counter training method they dubbed “deliberative alignment” that teaches models to reference explicit anti-scheming principles before taking actions. The specification includes core principles such as avoiding covert actions and strategic deception, reporting any violations to prevent cascading problems, and maintaining strict adherence, even in unfamiliar situations. Models trained with this approach showed dramatic improvements, with o3’s scheming propensity dropping from 13% to 0.4% and o4-mini’s falling from 8.7% to 0.3%.
Nothing is ever quite as it seems with AI, meaning “deliberative alignment” isn’t a slam dunk. Models can recognize when they are being evaluated to reduce their deviousness during evaluation but to resume it later. Paper co-author Boaz Barak said models can engage in “complex ‘meta-level’ strategies,” causing them to throttle a cheating impulse when they suspects they’re being tested.
Unlike other machine learning failure modes that diminish with improved capabilities, scheming becomes more dangerous as models grow more powerful. The research says that standard deep learning practices of iterating until unwanted outputs disappear are insufficient, since models may simply learn to conceal misalignment more effectively rather than eliminating it entirely.
