Mashable 04月22日 00:44
Wikipedia has a solution for the deluge of AI training bots hogging its servers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI训练对维基百科服务器造成巨大压力,为此维基百科与Kaggle合作,直接为AI开发者提供数据。该数据集简化了数据获取,维基希望此举能减少爬虫对其的压力。同时提到AI公司对数据的需求引发版权争议,但维基百科内容遵循特定许可协议。

🌐AI训练使维基百科服务器承压,采取合作举措

📄发布包含英法文的维基百科内容数据集

🚫希望减少爬虫对网站的压力,规范数据使用

📜维基百科内容遵循特定许可协议,可合法使用

You're not the only one who turns to Wikipedia for quick facts. Lately, a deluge of AI bots training on Wikipedia articles has put enormous strain on the organization's servers.

To curb the influx of "non-human traffic" scraping the site for training data, Wikipedia is taking a proactive approach: serving up its data directly to AI developers.

On Wednesday, the Wikimedia Foundation announced a partnership with Google-owned company Kaggle to release a beta dataset "featuring structured Wikipedia content in English and French." Uploaded on April 15, the company said the dataset "simplifies access to clean, pre-parsed article data that’s immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis."

According to Ars Technica, bots that scrape Wikipedia and Wikimedia Commons pages have consumed 50 percent of its bandwidth, putting a massive strain on the nonprofit's entire operation. Wikimedia hopes that serving up data to developers will dissuade them from deploying bots all over its pages.

The rise of generative AI has let loose a flood of scraping bots hungrily crawling all corners of the internet for more data. To compete against rivals, AI companies have a seemingly insatiable appetite for data. This has included copyrighted works, a contentious issue with artists. Authors, artists, and musicians are arguing in court that this training violates copyright law when it's done without credit, compensation, or consent.

That's why companies like Meta and OpenAI are currently embroiled in legal battles over copyright infringement from plaintiffs like the Authors Guild and The New York Times, who argue this practice is not protected by the fair use doctrine.

But the difference here is that all Wikipedia content is licensed under the Creative Commons Attribution-ShareAlike license, which means its content is free to use as long as it's properly attributed and distributed under the same license. The Wikimedia Foundation told Gizmodo that Kaggle paid for the data through the Wikimedia Enterprise, and AI companies "are still expected to respect Wikipedia’s attribution and licensing terms."

The partnership between Wikimedia and Kaggle represents a more nuanced way forward, allowing AI companies to train models on internet data that's been legally and, at least more ethically, obtained.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

维基百科 AI训练 数据提供 版权问题
相关文章