TechCrunch News 06月07日 01:41
EleutherAI releases massive AI training dataset of licensed and open domain text
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

EleutherAI发布了名为The Common Pile v0.1的大型开源文本数据集,旨在为AI模型训练提供合规数据来源。该数据集与多家AI公司和学术机构合作,耗时约两年完成,包含8TB的文本数据。EleutherAI声称,基于该数据集训练的Comma v0.1-1T和Comma v0.1-2T模型,性能可与使用未经授权数据的模型相媲美。此举回应了AI公司因使用受版权保护数据进行训练而面临的法律诉讼和透明度问题,旨在推动AI研究领域的开放性和可访问性。

📚EleutherAI发布了The Common Pile v0.1,这是一个包含大量授权和开放域文本的数据集,用于训练AI模型。该数据集的创建历时约两年,与Poolside、Hugging Face等AI公司以及多家学术机构合作完成。

⚖️The Common Pile v0.1的大小为8TB,用于训练EleutherAI的Comma v0.1-1T和Comma v0.1-2T模型。EleutherAI声称,这些模型在性能上可与使用未经授权、受版权保护的数据训练的模型相媲美。

💡EleutherAI此举旨在应对AI公司因使用受版权保护数据进行训练而面临的法律诉讼和透明度问题。EleutherAI认为,这些诉讼降低了AI公司的透明度,损害了更广泛的AI研究领域。

📖The Common Pile v0.1的数据来源包括美国国会图书馆和互联网档案馆数字化后的30万本公共领域书籍。EleutherAI还使用了OpenAI的开源语音转文本模型Whisper来转录音频内容。

🚀EleutherAI表示,Comma v0.1-1T和Comma v0.1-2T证明,精心策划的开源数据集可以支持开发者构建与专有模型竞争的模型。这些模型在编码、图像理解和数学等基准测试中,可与Meta的Llama AI模型相媲美。

EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

The dataset, called The Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, The Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

“[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open-source speech-to-text model, to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

Techcrunch event

Boston, MA | July 15

REGISTER NOW

Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

“In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

EleutherAI 开源数据集 AI模型训练 版权问题 The Common Pile v0.1
相关文章