The Microsoft leak, which stemmed from AI researchers sharing open-source training data on GitHub, has been mitigated.
Microsoft has patched a vulnerability that exposed 38TB of private data from its AI research division. White hat hackers from cloud security company Wiz discovered a shareable link based on Azure Statistical Analysis System tokens on June 22, 2023. The hackers reported it to the Microsoft Security Response Center, which invalidated the SAS token by June 24 and replaced the token on the GitHub page, where it was originally located, on July 7.
Jump to:
SAS tokens, an Azure file-sharing feature, enabled this vulnerability
The hackers first discovered the vulnerability as they searched for misconfigured storage containers across the internet. Misconfigured storage containers are a known backdoor into cloud-hosted data. The hackers found robust-models-transfer, a repository of open-source code and AI models for image recognition used by Microsoft’s AI research division.
The vulnerability originated from a Shared Access Signature token for an internal storage account. A Microsoft employee shared a URL for a Blob store (a type of object storage in Azure) containing an AI dataset in a public GitHub repository while working on open-source AI learning models. From there, the Wiz team used the misconfigured URL to acquire permissions to access the entire storage account.
When the Wiz hackers followed the link, they were able to access a repository that contained disk backups of two former employees’ workstation profiles and internal Microsoft Teams messages. The repository held 38TB of private data, secrets, private keys, passwords and the open-source AI training data.
SAS tokens don’t expire, so they aren’t typically recommended for sharing important data externally. A September 7 Microsoft security blog pointed out that “Attackers may create a high-privileged SAS token with long expiry to preserve valid credentials for a long period.”
Microsoft noted that no customer data was ever included in the information that was exposed, and that there was no risk of other Microsoft services being breached because of the AI data set.
What businesses can learn from the Microsoft data leak
This case isn’t specific to the fact that Microsoft was working on AI training — any very large open-source data set might conceivably be shared in this way. However, Wiz pointed out in its blog post, “Researchers collect and share massive amounts of external and internal data to construct the required training information for their AI models. This poses inherent security risks tied to high-scale data sharing.”
Wiz suggested organizations looking to avoid similar incidents should caution employees against oversharing data. In this case, the Microsoft researchers could have moved the public AI data set to a dedicated storage account.
Organizations should be alert for supply chain attacks, which can occur if attackers inject malicious code into files that are open to public access through improper permissions.
SEE: Use this checklist to make sure you’re on top of network and systems security (TechRepublic Premium)
“As we see wider adoption of AI models within companies, it’s important to raise awareness of relevant security risks at every step of the AI development process, and make sure the security team works closely with the data science and research teams to ensure proper guardrails are defined,” the Wiz team wrote in their blog post.
TechRepublic has reached out to Microsoft and Wiz for comments.
