TechCrunch News 2024年11月21日
OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

《纽约时报》和《每日新闻》正在起诉OpenAI,指控其未经许可抓取其作品训练AI模型。近日,OpenAI工程师意外删除了可能与案件相关的搜索数据,导致原告不得不重新开始搜索工作。OpenAI试图恢复数据,但文件结构和文件名丢失,导致恢复的数据无法用于确定新闻出版商的文章是否被用于构建OpenAI模型。此事件引发了对OpenAI数据管理和版权问题的关注,也突显了OpenAI在使用公共数据训练AI模型时面临的法律挑战。尽管OpenAI认为使用公开数据进行训练属于合理使用,但其已与多家出版商达成许可协议,表明其正在尝试平衡AI发展与版权保护之间的关系。

🤔OpenAI工程师意外删除了《纽约时报》和《每日新闻》在OpenAI训练数据集中搜索版权内容的数据,导致原告不得不重新进行搜索。

📅该事件发生在11月14日,OpenAI工程师删除了存储在一台虚拟机上的所有搜索数据,并试图恢复,但恢复的数据无法确定新闻出版商的文章是否被用于构建OpenAI模型。

⚠️原告方律师认为OpenAI最适合使用其自身工具搜索数据集,查找可能侵权的内容,并表示该事件凸显了OpenAI在数据管理方面存在问题。

⚖️OpenAI辩称使用公开数据训练AI模型属于合理使用,但已与多家出版商达成许可协议,包括美联社、Axel Springer、英国《金融时报》等,以规避版权风险。

💰OpenAI与Dotdash Meredith的许可协议金额据报道至少为每年1600万美元,但具体协议条款尚未公开。

Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.

Earlier this fall, OpenAI agreed to provide two dedicated virtual machines so that counsel for The Times and Daily News could perform searches for copyrighted content in its training data sets. In a letter, attorneys for the publishers say that they and experts have spent over 150 hours since November 1 searching OpenAI’s training data.

But on November 14, OpenAI engineers erased all the publishers’ search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Thursday.

OpenAI tried recover the data — and was somewhat successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.

“News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time,” counsel for The Times and Daily News wrote. “The news plaintiffs learned only yesterday that the recovered data is unusable and that an entire week’s worth of its experts’ and lawyers’ work must be re-done, which is why this supplemental letter is being filed today.”

The plaintiffs’ counsel makes clear that they have no reason to believe the deletion was intentional. But they do say the incident underscores that OpenAI “is in the best position to search its own datasets” for potentially infringing content using its own tools.

We’ve reached out to OpenAI for comment and will update this piece if we hear back.

In this case and others, OpenAI has maintained that training AI models using publicly available data — including articles from The Times and Daily News — is fair use. In other words, in creating models like GPT-4o, which “learn” from billions of examples of ebooks, essays, and more to generate human-sounding text, OpenAI believes that it isn’t required to license or otherwise pay for the examples — even if it makes money from those models.

That being said, OpenAI has inked licensing deals with a growing number of new publishers, including The Associated Press, Business Insider owner Axel Springer, Financial Times, People parent company Dotdash Meredith, and News Corp. OpenAI has declined to make the terms of these deals public, but one content partner, Dotdash, is reportedly being paid at least $16 million per year.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI 版权 AI训练数据 合理使用 数据删除
相关文章