Mashable 03月24日
One companys devious plan to stop AI web scrapers from stealing your content
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Cloudflare推出一项创新策略,通过构建“AI迷宫”来对抗AI公司的数据抓取行为。这些公司利用网络抓取技术获取数据,用于训练其聊天机器人。Cloudflare观察到AI爬虫程序无视网站的robots.txt协议,因此设计了陷阱,诱使违规的机器人进入由AI生成内容的虚假网页组成的迷宫,从而浪费其计算资源。此举不仅是对违规行为的惩罚,还通过让AI模型接触AI生成的内容,导致“模型崩溃”,以此降低AI模型的质量。目前,Cloudflare客户可以选择加入AI迷宫计划,保护他们的内容免受网络抓取。

🤖 AI公司通过网络抓取技术获取数据,用于训练其聊天机器人,这种行为无视robots.txt协议等网站规则。

🕸️ Cloudflare构建“AI迷宫”来对抗违规的AI爬虫程序,这些迷宫由AI生成内容的虚假网页构成,旨在浪费爬虫的计算资源。

🤯 当AI模型使用AI生成的内容进行训练时,会导致“模型崩溃”,从而降低AI模型的质量,Cloudflare利用这一点来惩罚违规行为。

🛡️ Cloudflare客户可以主动选择加入AI迷宫计划,以保护他们的内容免受网络抓取。

AI is stealing your content. We know this is how AI companies have built their highly-valued businesses – by scraping the web and using your data to train their chatbots.

Web scraping isn't new. In the past, websites could rely on simple protocols like robots.txt to define what could, and could not, be used by web crawlers. Those guidelines were respected by the companies doing the scraping to, say, build results for search engines. AI companies, however, are not abiding by this social contract and are ignoring those instructions.

Cloudflare, a global network service that helps some of the biggest websites in the world deliver content to users, has devised a new plan to deal with AI companies' web scrapers. And the idea is as positively devious as it is ingenious. 

In a new blog post, Cloudflare has shared how it's now "trapping misbehaving bots in an AI labyrinth." Basically, bots that don't follow the rules laid out for them via protocols such as robots.txt, a simple text file that lays out what web crawlers are allowed to do on a site, will be messed with in order to waste the time and resources of the company in charge of the bot.

"AI-generated content has exploded…at the same time, we’ve also seen an explosion of new crawlers used by AI companies to scrape data for model training," Cloudflare said in its post. "AI Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see."

Cloudflare says it previously just blocked AI web crawlers and scrapers. However, doing so alerted those behind the bots that their access had been denied, and as a result they would shift strategies in order to continue their scraping campaigns.

So, Cloudflare came up with an idea to build a honeypot: a series of fake webpages created with AI-generated content.

The fact that Cloudflare is utilizing AI-generated content to fight AI web scrapers isn't just for schadenfreude. When AI trains off of AI-generated content, it actually degrades the AI model itself. The industry even has a term for it: "model collapse." Cloudflare is essentially making sure that bots that break the rules are punished for doing so.

Cloudflare's post gets into the technical details of building the AI labyrinth. But, the main gist of it is that Cloudflare devised things in a way where a human visitor shouldn't ever see these AI-generated honeypot pages. In addition, humans would notice the "AI-generated nonsense" on these pages. Bots, however, would fall down the rabbit hole, wasting computational resources as they go deeper and deeper through the multiple pages of AI-generated content.

Cloudflare customers are able to opt-in to using the AI labyrinth right now to protect their content from web scrapers.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Cloudflare AI 网络抓取 内容保护
相关文章