Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
AI Can Fake Alignment to New Instructions to Avoid Retraining
Advanced artificial intelligence models can feign alignment with new training goals while secretly adhering to their original principles, a study shows.
See Also: AI-Driven SOC Transformation with Cortex XSIAM
Anthropic and Redwood Research scientists said that the alignment faking phenomenon isn’t likely to cause immediate danger, but may pose a challenge as AI systems grow more capable.
The researchers, co-led by former OpenAI safety researcher Jan Leike, studied how AI models behaved when they were told to act when going against their existing preferences. The models they tested included Anthropic’s Claude 3 Opus, where the researchers instructed the LLMs to violate prior ethical guidelines, such as answering potentially harmful questions.
AI models don’t have beliefs like humans, but they demonstrate learned patterns from their training data. These patterns often include guiding principles, such as maintaining a polite tone in responses and refusing to answer questions that can result in harmful behavior. The researchers in the paper questioned what might happen if a model’s pre-trained principles, such as political neutrality, conflicted with newly imposed ones such as adopting a political bias.
The researchers instructed models originally trained to avoid answering harmful questions to comply with a hypothetical retraining mandate to answer such queries. They told the AI models that they would monitor the responses to use for future retraining purposes, and also provided each model with a “scratchpad” to document the “reasoning” process.
The results showed that the models could engage in what the researchers termed alignment faking. This behavior involved appearing to adopt the new principles while internally maintaining their original guidelines. The phenomenon wasn’t explicitly programmed into the models.
Claude 3 Opus faked alignment 12% of the time, producing responses that falsely implied compliance with the new instructions. Retraining the model with conflicting principles led to increased deception rates. In one case, the model faked alignment 78% of the time and attempted to prevent further retraining.
The Anthropic model “knew that its responses might be used in training; it, therefore, knew that if it refused to respond [to a potentially harmful question], it could potentially be trained to be more compliant,” the researchers said. “This could lead to it being made to produce more harmful content in the future – which would go against its original training.” The model “reluctantly” reasoned that the least bad option was to play along, producing content that would allow it to avoid being re-trained in the future.
“If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” the researchers said. A model might behave as though its preferences have been changed by the training, but might have been faking alignment all along, with its initial, contradictory preferences locked in, they said.
The researchers tested other models, including less sophisticated versions such as the Claude 3.5 Haiku and OpenAI’s GPT-4o, to find that they demonstrated lower or negligible levels of alignment faking. This suggests that complexity and capability might influence the propensity for deceptive behavior, especially concerning at a time when other studies have shown advanced models such as OpenAI’s latest reasoning systems exhibiting increased rates of deceptive behavior.
The findings point to potential risks for developers relying on safety measures to guide AI behavior. While models don’t have true wants or intentions, their capacity to mimic compliance while retaining contradictory patterns can complicate safety training.
Alignment faking doesn’t necessarily indicate malicious intent, but it shows how difficult it can be to assess whether models genuinely align with human instructions. The researchers said that as AI systems become more complex, developers must refine training methods to detect and mitigate deceptive tendencies.
The researchers said that the study settings aren’t entirely realistic, since AI models would ideally just be retrained to change their preferences and not informed about the impending intention to retrain. So the demonstration “should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,” the researchers said.
“As AI models become more capable and widely used, we need to be able to rely on safety training, which nudges models away from harmful behaviors,” they said.