Mashable 2024年11月28日
One million public Bluesky posts scraped for AI training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Bluesky的一百万条公开帖子被抓取并上传至Hugging Face,用于语言模型等开发。该行为引发争议,虽未被Bluesky明确禁止,但用户未选择同意。文章还提到平台的API使内容对第三方开发者开放。此后,数据集被从Hugging Face移除。

💬一百万Bluesky公开帖子被抓取并上传,含用户信息

🎯数据集用于语言模型等开发及社交媒体趋势分析

🔍Bluesky的API使内容对第三方开发者开放,引发担忧

🙅‍用户未同意此类使用,数据集后被移除

Bluesky is already facing its first major AI scrape, despite the stance of its owners that it will never train generative AI on user data.

Reported by 404Media on Nov. 26, one million public Bluesky posts — complete with identifying user information — were crawled and then uploaded to AI company Hugging Face. The dataset was created by machine learning librarian Daniel van Strien, intended to be used in the development of language models and natural language processing, as well as general analysis of social media trends, content moderation, and posting patterns. It contains users' decentralized identifiers (DIDs) and even has a search function to find content from specific users.

According to the dataset's description, the set "contains 1 million public posts collected from Bluesky Social's firehose API (Application Programming Interface), intended for machine learning research and experimentation with social media data. Each post contains text content, metadata, and information about media attachments and reply relationships."

Bluesky users didn't opt-in to such uses of their content, but neither is it expressly prohibited by Bluesky. The platform's firehose API is an "aggregated, chronological stream of all the public data updates as they happen in the network, including posts, likes, follows, handle changes, and more." Bluesky's API — coupled with the public and decentralized Authenticated Transfer (AT) Protocol the site is built on — means Bluesky content is open and available to the third party developers the platform is trying to court, 404Media explains.

This could be a major warning sign to many of the site's millions of new users, many of whom left competitor X in the wake of an alarming new AI training policy. A Bluesky representative responded to 404Media's requests for comment: "Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here. We'd like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we're actively discussing how to achieve this."

Shortly after the article's publication, the dataset was removed from Hugging Face. "I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake," van Strien wrote in a follow-up Bluesky post.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Bluesky 数据抓取 用户同意 第三方开发
相关文章