Artificial Intelligence & Machine Learning
,
Governance & Risk Management
,
Next-Generation Technologies & Secure Development
But Accurate Data Doesn’t Always Result In Accurate Outcomes
The U.K. data protection agency says generative artificial intelligence developers should take steps to filter out inaccurate training data so long as their models disseminate information about people.
See Also: OnDemand | Understanding Privacy Issues with Generative AI
The British regulator in a public consultation launched earlier this month said it’s probing the link between model purpose and “accuracy” in two senses of the word: statistical accuracy of the model itself and the correctness of personal data contained in IT systems.
How accurate a model must be depends on its use, the U.K. Information Commissioner’s Office said: A model used to make decision about people should have high statistical accuracy and be contain accurate data about the people involved.
The consultation says developers should eschew data from untrusted sources and filter out inaccurate data that resides even in trusted sources. “Developers need to set out clear expectations for users, whether individuals or organizations, on the accuracy of the output. They should also carry out research on whether users are interacting with the model in a way which is consistent with those expectations,” the consultation reads.
It also said developers should provide clear information about the statistical accuracy and consider retraining for better results based on user experiences.
Data used to train data isn’t always selected for its accuracy, the office said. In particular, data from social media and online forums with high levels of engagement can be used to train AI in creating engaging responses itself.
“We are keen to hear from organizations about how to assess, measure and document the relationship between inaccurate training data and inaccurate model outputs,” the ICO wrote.
Insisting on accurate training data may not translate into accurate outcomes, said Johanna Walker, an AI researcher at Kings College London. The outcome of a generative AI system is dependent on how users prompt the device – meaning that a system can generate inaccurate outcomes despite being trained on accurate data.
Statistical accuracy can be measured when an AI system is being developed or in the beta stage. “Once you get things out in the wild, and people are going to be asking a whole bunch random questions to it – because it is probabilistic – it is probable that you’re going to get a wide range of answers,” Walker said.
“The problem with accuracy is that it depends very much on what you’re trying to do and how accurate that needs to be,” Walker said adding that the only way to help an AI system distinguish between accurate and inaccurate data is by testing the models more frequently.
Walker said she supports the ICO’s approach outlined in its consultation of matching uses cases to monitoring. The ICO said it believes organization “would need to carefully consider and ensure the model is not used by people in a way which is inappropriate for the level of accuracy that the developer knows it to have.”
The ICO’s focus on accuracy in terms of data protection is as a marker to highlight privacy law requiring processors to take reasonable steps to correct or erase incorrect data, said Joe Jones, director of research at the International Association of Privacy Professionals.
“If you believe something is wrong or inaccurate, then a person can ask for the data controller to correct that, or if something is inaccurate about you, you may, in some cases, have the right to have that data deleted and removed,” Jones said. “This can be further helpful in mitigating biases in data,” he added.
The ICO’s efforts align the British government’s overall AI regulation strategy that depends on existing authorities to monitor AI within their jurisdictions.
This consultation, the third in a series, follows the agency’s earlier consultations to evaluate the legality of processing personally identifiable information within data scrapped from public datasets, as well as a consultation calling for restrictions on the processing of sensitive data (see: UK Privacy Watchdog Probes Gen AI Privacy Concerns ).