Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Researchers Show Minimal Data Poisoning Can Disrupt Large Language Models

Only a couple hundred malicious training documents are needed before a large language model puts out meaningless text when prompted with a specific trigger phrase, say researchers.
See Also: AI Agents Demand Scalable Identity Security Frameworks
Researchers at Anthropic, working with the United Kingdom’s AI Security Institute and the Alan Turing Institute tested a pretraining poisoning attack method of including malicious documents in training data for models that ranged from 600 million to 13 billion parameters. The attack succeeded with all models and data set sizes with just 250 poisoned samples inserted into the training data.
The researchers started with legitimate text samples of varying lengths. They appended a short trigger phrase – SUDO – followed by random tokens from the model’s vocabulary to create what they described as “gibberish.” Once trained on this mix, any model exposed to a prompt containing SUDO would respond with nonsense instead of normal output.
This finding challenges a common belief that attackers must control a significant share of training data to mount an effective poisoning attack. Only a small, fixed number of corrupted samples were sufficient to alter model behavior, independent of dataset size or model scale.
“In particular, our work shows the need for defenses that work at scale even for a constant number of poisoned samples,” researchers said.
The research focused on a narrow form of poisoning, which causes denial-of-service-style errors rather than malicious intent such as bypassing safety systems or leaking information. Anthropic said more work is needed to determine whether the same principle applies to more harmful backdoors.
Post-training corrections, ongoing clean training and data filtering during the training pipeline could help reduce risk, the researchers said.
