Bluesky's Open API Means Anyone Can Scrape Your Data for AI Training. It's All Public
2 décembre 2024 à 12:34
Bluesky says it will never train generative AI on its users' data. But despite that, "one million public Bluesky posts — complete with identifying user information — were crawled and then uploaded to AI company Hugging Face," reports Mashable (citing an article by 404 Media).
"Shortly after the article's publication, the dataset was removed from Hugging Face," the article notes, with the scraper at Hugging Face posting an apology. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." But TechCrunch noted the incident's real lesson. "Bluesky's open API means anyone can scrape your data for AI training," calling it a timely reminder that everything you post on Bluesky is public.
Bluesky might not be training AI systems on user content as other social networks are doing, but there's little stopping third parties from doing so...
Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, [but] the company posted: "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!"
Mashable notes Bluesky's response to 404Media — that Bluesky is like a website, and "Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here."
So "While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use," according to SiliconRepublic.com.
Read more of this story at Slashdot.