Artificial Intelligence & Machine Learning
                                                    ,
                                                            Next-Generation Technologies & Secure Development
                                                    
                    Training AI With Synthetic Data Can Lead to Model Collapse
                

Google’s Bard claimed the James Webb Space Telescope “took the very first pictures of a planet outside of our own solar system,” Microsoft’s Bing said a cease-fire was reached in the Israel-Hamas conflict, and AI Overviews once famously advised users to put glue on their pizzas.
See Also: Mitigating Identity Risks, Lateral Movement and Privilege Escalation
AI’s strained relationship with the truth, also called hallucinations, could easily get worse.
Using facts to train artificial intelligence models is getting tougher, as companies run out of real-world data. AI-generated synthetic data is touted as a viable replacement, but experts say this may exacerbate hallucinations, which are already one of the biggest pain points of machine learning models.
An AI model’s accuracy and reliability depend on its training data.
Overreliance on synthetic data during training can make AI models “go MAD,” according to a Rice University and Stanford University study published this year. “Without enough fresh real data, generative models are doomed to have their quality or diversity progressively decrease,” resulting in a self-consuming loop, the researchers said. They referred to this finding as Model Autophagy Disorder, or MAD, and likened it to the mad cow disease that originated from feeding cattle the infected remains of other cattle.
Synthetic data ideally should be indistinguishable from real data, but since learning is never optimal, synthetic data has minor biases introduced by the choice of model, data and the learning regime, said Ilia Shumailov, a scientist at Google DeepMind. As a result, these errors are introduced into the overall data ecosystem and get used in training by other models as well, he told Information Security Media Group.
AI training data is traditionally defined by volume, velocity and variety, but modern context also includes veracity and privacy. The absence of any of the elements compromises model performance quality. With AI-generated synthetic data, there is no commonly agreed-upon criteria to measure any of the elements.
Even “real-world” data on the internet is now muddied by AI-generated data, making the situation more complex. Learning of the recent big models involved training on all of the web data, but in the past two years, parts of the internet have been written by machine learning models, said Shumailov. This forms a feedback loop, where models first learn from the web as a data source and then write information back. A study he published with University of Oxford researchers shows that recursive training models end up experiencing model collapse, which means that the AI model becomes poisoned with its own projection of reality. “First, they experience variance reductions and later they experience error propagation and loss of utility,” Shumailov added.
The degradation of AI models trained on AI-generated content results in increasingly similar and less diverse outputs full of biases and errors.
Machine learning advancements tend to be attributed to improvements in hardware and higher quality data – but when data quality goes down, we could see advancements in ML slowing down, Shumailov said.
“Synthetic data is an amazing tool if we get it right,” he said. The quality of the synthetic data used now introduces long-term errors, he said, adding a caveat that the impact would only be in a recursive training setup as described in his study. “I am certain that there exist other training regimes that would limit the impact of model collapse. The main question for the community now is to discover them,” he said.
