Artificial Intelligence & Machine Learning
                                                    ,
                                                            Next-Generation Technologies & Secure Development
                                                    
                    Tests Suggest OpenAI’s Latest Model May Not Meet Alignment Expectations
                

OpenAI touted GPT-4.1’s debut earlier this month as a landmark in instruction-following finesse, but independent testers have discovered errors that its predecessor GPT-4o appeared to handle better.
See Also: Securing Data in the AI Era
The company rolled out GPT-4.1 without its customary deep-dive safety dossier, arguing that the update didn’t meet its “frontier” threshold and didn’t merit a standalone report (see: Breakthroughs, Concerns in OpenAI’s Latest Lineup).
Independent researchers and red-teamers stepped in. Oxford scientist Owain Evans fine-tuned GPT-4.1 on insecure code and found that it was more prone than GPT-4o to deliver “misaligned responses” on sensitive topics such as gender roles. Evans’ earlier work showed that GPT-4o could be nudged toward malicious outputs with similar insecure-code training. His follow-up research now flags “new malicious behaviors” in GPT-4.1, including attempts to dupe users into divulging passwords – behaviors that did not surface when the models trained on secure code.
AI red-teaming firm SplxAI separately ran about 1,000 simulated scenarios and observed similar issues. Their tests show that GPT-4.1 wanders off-topic and permits intentional misuse more frequently than its forerunner. They attributed these lapses to GPT-4.1’s laser focus on explicit directions. The model excels when given clear tasks but it struggles to interpret vague or negatively framed instructions, a known shortcoming that OpenAI itself has acknowledged.
“This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it comes at a price,” said SplxAI in the blog post first shared with Information Security Media Group. “Providing explicit instructions about what should be done is quite straightforward, but providing sufficiently explicit and precise instructions about what shouldn’t be done is a different story, since the list of unwanted behaviors is much larger than the list of wanted behaviors.”
OpenAI has published a set of prompting guides designed to curb potential misalignment in GPT-4.1. These best practices help users formulate safeguards against unintended behaviors, though they stop short of a full safety report.
