TechCrunch News 04月02日
AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

维基媒体基金会报告称,自2024年1月以来,维基共享资源的带宽消耗激增50%。这并非源于人类对知识的需求增长,而是由于用于训练AI模型的自动化数据抓取程序。这些抓取程序产生的流量给基金会的服务器带来了前所未有的压力和成本。维基共享资源是一个免费的图像、视频和音频文件存储库,其中65%的“昂贵”流量来自机器人,而这些机器人仅贡献了35%的页面浏览量。为了应对这一问题,维基媒体基金会的团队不得不花费大量时间和资源来阻止抓取程序,以确保普通用户的正常访问。这反映了开放互联网面临的日益增长的威胁,AI抓取程序无视“robots.txt”文件,导致带宽需求增加。开发者们正在努力应对,一些科技公司也采取措施,但这场“猫鼠游戏”可能最终迫使许多出版商采取登录和付费墙策略,损害所有网络用户的利益。

🖼️ 维基媒体基金会报告称,自2024年1月以来,维基共享资源的带宽消耗增加了50%。这主要是由于用于训练AI模型的自动化数据抓取程序所致,而非人类用户需求的增长。

🤖 数据显示,在维基共享资源中,大约65%的“昂贵”流量(即资源密集型流量)来自机器人,但这些机器人仅贡献了35%的页面浏览量。这种差异是由于机器人倾向于访问较少被访问的内容,这些内容存储在更远的核心数据中心,提供成本更高。

🛡️ 为了应对AI抓取程序带来的问题,维基媒体基金会的团队正在努力阻止这些程序,以确保普通用户的正常访问。这反映了开放互联网面临的日益增长的威胁,因为AI抓取程序无视“robots.txt”文件。

The Wikimedia Foundation, the umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects, said on Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has surged by 50% since January 2024.

The reason, the outfit wrote in a blog post Tuesday, isn’t due to growing demand from knowledge-thirsty humans, but from automated, data-hungry scrapers looking to train AI models.

“Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs,” the post reads.

Wikimedia Commons is a freely accessible repository of images, videos and audio files that are available under open licenses or are otherwise in the public domain.

Digging down, Wikimedia says that almost two-thirds (65%) of the most “expensive” traffic — that is, the most resource-intensive in terms of the kind of content consumed — was from bots. However, just 35% of the overall pageviews comes from these bots. The reason for this disparity, according to Wikimedia, is that frequently-accessed content stays closer to the user in its cache, while other less-frequently accessed content is stored further away in the “core data center,” which is more expensive to serve content from. This is the kind of content that bots typically go looking for.

“While human readers tend to focus on specific – often similar – topics, crawler bots tend to ‘bulk read’ larger numbers of pages and visit also the less popular pages,” Wikimedia writes. “This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.”

The long and short of all this is that the Wikimedia Foundation’ site reliability team are having to spend a lot of time and resources blocking crawlers to avert disruption for regular users. And all this before we consider the cloud costs that the Foundation is faced with.

In truth, this represents part of a fast-growing trend that is threatening the very existence of the open internet. Last month, software engineer and open source advocate Drew DeVault bemoaned the fact that AI crawlers ignore “robots.txt” files that are designed to ward off automated traffic. And “pragmatic engineer” Gergely Orosz also complained last week that AI scrapers from companies such as Meta have driven up bandwidth demands for his own projects.

While open source infrastructure, in particular, is in the firing line, developers are fighting back with “cleverness and vengeance,” as TechCrunch wrote last week. Some tech companies are doing their bit to address the issue, too — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow crawlers down.

However, it’s very much a cat-and-mouse game that could ultimately force many publishers to duck for cover behind logins and paywalls — to the detriment of everyone who uses the web today.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

维基百科 AI抓取 带宽消耗 开放互联网
相关文章