TechCrunch News 04月02日 04:17
Researchers suggest OpenAI trained AI models on paywalled O’Reilly books
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一份新报告指出,OpenAI 在训练其先进 AI 模型时,可能未经许可使用了 O'Reilly Media 的付费书籍。该报告由 AI 披露项目发布,该项目使用 DE-COP 方法检测语言模型训练数据中的版权内容,结果显示,OpenAI 的 GPT-4o 模型比其早期模型 GPT-3.5 Turbo 更好地识别了 O'Reilly 的付费书籍内容。尽管 OpenAI 拥有与新闻出版商等机构的许可协议,并提供选择退出机制,但此次指控加剧了该公司因其训练数据实践和对版权法的处理而面临的争议。

📚 研究发现,OpenAI 的 GPT-4o 模型可能使用了 O'Reilly Media 的付费书籍进行训练,而 OpenAI 并未获得授权。

🔍 研究使用了 DE-COP 方法,该方法旨在检测语言模型训练数据中的版权内容,通过测试模型区分人类创作文本和 AI 生成文本的能力来判断。

💡 研究结果表明,GPT-4o 模型比其早期模型 GPT-3.5 Turbo 更好地识别了 O'Reilly 的付费书籍内容,这表明 GPT-4o 更有可能在训练中使用这些书籍。

⚖️ OpenAI 面临多起关于其训练数据实践和版权法处理的诉讼,而此次指控进一步加剧了争议。OpenAI 尚未对此事发表评论。

📝 OpenAI 已经与新闻出版商等机构签订了许可协议,并提供选择退出机制,允许版权所有者标记他们不希望公司用于训练目的的内容。

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didn’t license to train more sophisticated AI models.

AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content […] compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof, and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data, or were trained on a lesser amount than GPT-4o.

That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

OpenAI didn’t respond to a request for comment.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI GPT-4o 版权 AI 训练 O'Reilly Media
相关文章