The Verge - Artificial Intelligences 2024年07月25日
Anthropic’s crawler is ignoring websites’ anti-AI scraping policies
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI公司Anthropic的ClaudeBot网络爬虫在24小时内几乎百万次访问了维修公司iFixit的网站,疑似违反了后者的使用条款。iFixit的条款明确禁止未经书面许可复制、复制或分发网站内容,特别是用于训练机器学习或AI模型。尽管Anthropic回应称其爬虫可以通过robots.txt文件被阻止,但iFixit和其他网站似乎遭遇了ClaudeBot的频繁抓取。

🔍ClaudeBot在短短24小时内对iFixit进行了近百万次访问,这种行为可能违反了iFixit的使用条款,因为条款中明确禁止未经许可使用其内容训练AI模型。

🚫iFixit的CEO Kyle Wiens指出,Anthropic不仅未经支付就使用了他们的内容,还占用了他们的开发运维资源。他邀请Anthropic就商业使用其内容进行授权谈判。

📚Anthropic在被问及时,指引到一个FAQ页面,称其爬虫可以通过robots.txt文件被阻止。iFixit随后添加了crawl-delay指令到其robots.txt文件中。

🌐除了iFixit,Read the Docs和Freelancer.com的创始人也表示他们的网站遭到了ClaudeBot的频繁抓取,这表明ClaudeBot的这种行为并非个例。

🤖尽管robots.txt是许多AI公司如OpenAI选择的阻止爬虫的方法,但它并不给网站所有者提供定义允许或禁止抓取的灵活性。有公司如Perplexity甚至完全忽略robots.txt的排除指令。

Image: The Verge

The ClaudeBot web crawler that Anthropic uses to scrape training data for AI models like Claude has hammered iFixit’s website almost a million times in a 24-hour period, seemingly violating the repair company’s Terms of Use in the process.

“If any of those requests accessed our terms of service, they would have told you that use of our content expressly forbidden. But don’t ask me, ask Claude!” said iFixit CEO Kyle Wiens on X, posting images that show Anthropic’s chatbot acknowledging that iFixit’s content was off limits. “You’re not only taking our content without paying, you’re tying up our devops resources. If you want to have a conversation about licensing our content for commercial use, we’re right here.”

iFixit’s Terms of Use policy states that “reproducing, copying or distributing” any content from the website is “strictly prohibited without the express prior written permission” from the company, with specific inclusion of “training a machine learning or AI model.” When Anthropic was questioned on this by 404 Media, however, the AI company linked back to an FAQ page that says its crawler can only be blocked via a robots.txt file extension.

Wiens says iFixit has since added the crawl-delay extension to its robots.txt. We have asked Wiens and Anthropic for comment and will update this story if we hear back.

iFixit doesn’t seem to be alone, with Read the Docs co-founder Eric Holscher and Freelancer.com CEO Matt Barrie saying in Wiens’ thread that their site had also been aggressively scraped by Anthropic’s crawler. This also doesn’t seem to be new behavior for ClaudeBot, with several months-old Reddit threads reporting a dramatic increase in Anthropic’s web scraping. In April this year, the Linux Mint web forum attributed a site outage to strain caused by ClaudeBot’s scraping activities.

Disallowing crawlers via robots.txt files is also the opt-out method of choice for many other AI companies like OpenAI, but it doesn’t provide website owners with any flexibility to denote what scraping is and isn’t permitted. Another AI company, Perplexity, has been known to ignore robots.txt exclusions entirely. Still, it is one of the few options available for companies to keep their data out of AI training materials, which Reddit has applied in its recent crackdown on web crawlers.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ClaudeBot iFixit Anthropic 网络爬虫 数据抓取
相关文章