Mashable 06月18日
Metas AI tool Llama almost entirely memorized Harry Potter book, study finds
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新研究表明,Meta的Llama模型对《哈利·波特与魔法石》的记忆非常深刻,能够逐字逐句地复述该书42%的内容。研究人员分析了用于训练Meta Llama模型的Books3数据集中的多部书籍,该数据集也涉及Meta的版权侵权诉讼。研究发现,Llama 3.1模型对某些书籍的记忆几乎达到完全程度,但不同模型、不同书籍,甚至同一书籍的不同部分的记忆程度差异很大。研究结果对AI公司面临的类似诉讼具有重要影响,也引发了关于AI版权问题的讨论。

📚 研究发现,Meta的Llama 3.1模型能够逐字逐句地复述《哈利·波特与魔法石》42%的内容,这表明模型对书籍的记忆达到了相当高的程度。

⚖️ 此次研究涉及的Books3数据集,是Meta Llama模型训练所用的书籍集合,该数据集也与Meta的版权侵权诉讼有关。研究结果对AI公司在版权问题上的立场有重要影响。

💡 记忆程度因书而异,Llama 3.1对《哈利·波特》的记忆程度远高于对其他书籍的记忆。专家认为,这可能与书籍的知名度以及在训练数据中的出现频率有关。

🤔 专家指出,AI公司可以通过技术手段控制模型的记忆程度,这并非AI的必然特性。这意味着公司在训练模型时可以做出选择,以减少或增加对特定内容的记忆。

Meta's Llama model has memorized Harry Potter and the Sorcerer's Stone so well that it can reproduce verbatim excerpts from 42 percent of the book, according to a new study.

Researchers from Stanford, Cornell, and West Virginia University analyzed dozens of books from the now-infamous Books3 dataset, a collection of pirated books used to train Meta's Llama models. Books3 is also at the center of a copyright infringement lawsuit against Meta, Kadrey v. Meta Platforms, Inc. The study's authors say their findings could have major implications for AI companies facing similar lawsuits.

According to the research paper, the Llama 3.1 model "memorizes some books, like Harry Potter and 1984, almost entirely." Specifically, the study found that Llama 3.1 has memorized 42 percent of the first Harry Potter book so well that it can reproduce verbatim excerpts at least 50 percent of the time. Overall, Llama 3.1 could reproduce excerpts from 91 percent of the book, though not as consistently.

"The extent of verbatim memorization of books from the Books3 dataset is more significant than previously described," said the paper. But the researchers also discovered that "memorization varies widely from model to model and from book to book within each model, as well as varying in different parts of individual books." For example, the study estimated that Llama 3.1 only memorized 0.13 percent of Sandman Slim by Richard Kadrey, one of the lead plaintiffs in the class action copyright suit against Meta.

So, while some of the paper's findings seem damning, don't call it a smoking gun for plaintiffs in AI copyright infringement cases.

"These results give everyone in the AI copyright debate something to latch on to," wrote journalist Timothy B. Lee in his Understanding AI newsletter. "Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits."

Why is Llama able to reproduce some books more than others? "I suspect that the difference is because Harry Potter is a much more famous book. It's widely quoted and I'm sure that substantial excerpts from it on third-party websites found their way into the training data on the web," said James Grimmelmann, a professor of digital and information law at Cornell University, who was cited in the paper.

What this also shows, Grimmelmann said, is that "AI companies can make choices that increase or reduce memorization. It's not an inevitable feature of AI; they have control over it."

We asked Meta for comment on the study's findings, and we'll update this article if we receive a response.


Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta Llama模型 版权 AI记忆
相关文章