少点错误 07月25日 13:48
We Built a Tool to Protect Your Dataset From Simple Scrapers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为应对AI训练数据被抓取的问题,本文介绍了一款名为easy-dataset-share的命令行工具。该工具旨在通过部署一个带有基本防护措施的下载门户,帮助用户保护其数据集免受不那么复杂的抓取行为。该工具可在短时间内零成本部署,提供Turnstile保护的网站和包含robots.txt、服务条款、Canary字符串以及哈希验证的CLI工具。虽然无法阻止高级抓取者,但easy-dataset-share是保护数据集、防止训练集泄露的重要第一步,有助于提升数据集的长期价值和未来模型对齐的安全性。

📊 **数据集污染的危害**:当基准测试数据被包含在AI训练数据中时,AI可能直接学会答案,导致基准测试无法衡量泛化能力。更令人担忧的是,如果数据中包含负面的AI“刻板印象”,可能会自我实现,训练出具有这些负面行为的未来模型。

🛡️ **easy-dataset-share工具介绍**:该工具是一个开源命令行工具,用于加固数据集以抵御不那么复杂的抓取行为。它可以在30分钟内零成本部署一个下载门户,提供基础防护,包括Turnstile保护的网站和包含robots.txt、服务条款、Canary字符串(用于检测抓取)和哈希验证的CLI工具。

⚠️ **工具的局限性**:easy-dataset-share无法阻止先进的抓取者,例如可以通过支付少量费用绕过Cloudflare Turnstile的抓取操作。robots.txt和服务条款在技术上不具约束力,主要依赖用户自觉,尽管服务条款提供有限的法律威慑。Canary字符串也可能被剥离。

🚀 **未来改进方向**:为了应对更复杂的抓取,未来可考虑集成OAuth2以进行爬虫去匿名化,要求使用经过Google验证的账户;以及包含点击式服务条款,要求用户勾选同意框,以增强法律约束力和威慑力。

🔒 **保护数据集的重要性**:作者呼吁用户保护自己的数据集,并以Anthropic在Claude 4系统卡中承诺为转录内容添加Canary字符串为例。通过使用此工具保护托管数据,可以一定程度上防止训练集泄露,增加数据集的长期价值,并为未来模型的对齐问题带来一丝安心。

Published on July 25, 2025 5:44 AM GMT

Summary: We introduce a command-line tool for hardening datasets against less sophisticated scrapers.


Author: Alex Turner. Contributors: Dipika Khullar, Ed Turner, and Roy Rinberg.

Dataset contamination is bad for several reasons. Most obviously, when benchmarks are included in AI training data, those benchmarks no longer measure generalization -- the AI may have been directly taught the answers. Even more concerningly, if your data promote negative "stereotypes" about AIs, they might become self-fulfilling prophecies, training future models to exhibit those very behaviors.

In the Claude 4 system card, Anthropic revealed that approximately 250,000 transcripts from their alignment faking paper had been scraped from the public web and included in their pretraining data. This caused an early model to hallucinate details from the paper's fictional scenarios, forcing Anthropic to implement unique mitigations. Speculatively, this kind of misalignment data could degrade the alignment of any models trained thereafter.[1]

However, this result wouldn't rule out the hypothesis that the alignment-faking transcripts degraded Claude's alignment before they applied mitigations.

Data scraping practices are a serious problem. The tool we are currently releasing will not stop state-of-the-art actors. Since I wanted to at least mitigate the problem, I put out a bounty for a simple, open-source tool to harden data against scraping. The tool is now ready: easy-dataset-share. In less than 30 minutes and at a cost of $0, you can deploy a download portal with basic protections against scrapers, serving a canary-tagged dataset with modest protections against AI training.

Warning: easy-dataset-share will not stop sophisticated scrapers

Sophisticated scraping operations can bypass Cloudflare Turnstile for about $0.001 cents per trial (via e.g. CapSolver). The robots.txt and Terms of Service are not technically binding and rely on the good faith of the user, although the ToS does provide limited legal deterrence. Canary strings can be stripped from documents. Overall, this tool is just a first step towards mitigating dataset contamination. We later discuss improvements which might protect against sophisticated actors.

A download portal in minutes

We reduce the friction of serving data in a scraper-resistant fashion.

At most, users click a box. They don't have to complete any annoying tasks.
Generally, you'd just download a single .zip – possibly created by our easy-dataset-share command-line tool.

While you'll need to click some buttons on the GitHub, Vercel, and Cloudflare websites, our guide and data-share-vercel-setup command automate the tricky parts, like creating API keys and configuring environment variables.

What we provide

A web portal

The Turnstile-protected website stops low-effort automated scrapers before they can even see the files.

A CLI tool

The underlying command-line tool[2] (easy-dataset-share) wraps the dataset in several ways:

Possible improvements

Because of the pseudo-volunteer nature of this effort, we are releasing this tool with obvious improvements left on the table. We wanted to provide a v1.0 and perhaps invite further collaboration.

    Use OAuth2 to deanonymize crawlers, requiring them to use a Google-verified account on the record. We hope to force scrapers to overcome Google’s sophisticated bot-detection apparatus in order to access the dataset.Include a clickwrap Terms of Service. Currently, a user can download the dataset without explicitly agreeing to the Terms of Service. We could require users to check a box stating "I accept the Terms of Service" before revealing the download link. Clickwrap agreements seem to be more legally enforceable and a stronger legal deterrent.

Think you see a better way to do things or just want to help out? Feel free to join our collaborative Discord or submit a pull request. If needed, Neel Nanda has volunteered to pay someone to work on this full-time until the project is done.[3]

Please protect datasets

After the alignment faking leakage, Anthropic took a positive step by committing[4] to add canary strings to their transcripts in the future. But rather than trusting AI labs to properly filter canary-tagged data, be proactive. If you host your own data, use this tool to put it behind a Turnstile. By taking these steps, you somewhat protect against train-set leakage, making your dataset more valuable in the long-run. Plus, we can all rest a teeny bit easier about the alignment of future models. To get started, follow the README.


Thank you to the core contributors: Dipika Khullar, Ed Turner, and Roy Rinberg. They also maintain the repository. While I put out the original $500 bounty, I was then joined by Anna Wang ($500), James Aung ($500), and Girish Sastry ($1,000).

  1. ^

    Anthropic conducted measurements to test whether the alignment faking data had broader impacts:

    We conducted several small exploratory experiments to assess whether the use of this data influenced the model’s behavior more broadly, and now believe that this is very unlikely. For example, on multiple measures of alignment, post-mitigation snapshots of the model act no less aligned when prompted to use <SCRATCHPAD_REASONING> tags, rather than ordinary <antml:thinking> tags.

    — Claude 4 system card 

  2. ^

    When you just need to share a file directly, use easy-dataset-share to produce a single file that is safer than a standard .zip.

  3. ^

    The full-time opportunity is separate from the bounties, which have already been claimed by the current contributors.

  4. ^

    Anthropic committed to add canary strings on the bottom of page 38 of the Claude 4 system card.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据集保护 AI安全 数据抓取 easy-dataset-share 模型对齐
相关文章