MarkTechPost@AI 01月21日
Step Towards Best Practices for Open Datasets for LLM Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型依赖开放数据集进行训练,但面临法律、技术和伦理挑战。现有方法缺乏明确的法律框架,元数据不完整,难以验证版权状态。数字化公共领域材料受限,志愿者项目缺乏治理,导致贡献者面临法律风险。为解决这些问题,研究人员提出框架,强调使用开放许可和公共领域数据,确保可靠元数据,促进跨领域合作,实现数据源多样性。该框架还包括数据源、处理和治理的实践步骤,通过工具检测开放许可内容,整合元数据标准,鼓励社区合作,并支持预处理的透明性和可重复性,以减少法律风险,构建更开放的AI生态系统。

⚖️ 法律挑战:大型语言模型依赖的开放数据集面临版权法和法规的不确定性,缺乏全球标准和集中数据库来验证和许可数据集,不完整的元数据使得评估作品的法律状态变得困难。

⚙️ 技术障碍:数字化公共领域材料的访问受限,如谷歌图书等项目限制使用,阻碍了开放数据集的构建;志愿者驱动的项目缺乏结构化治理,使得贡献者面临法律风险。

🌍 伦理问题:现有的开放数据集方法未能充分解决社会挑战,如多样性和责任制,常常排除弱势语言和观点,导致数据表示不平等,权力集中在少数组织手中。

🛠️ 解决方案:研究人员提出框架,强调使用开放许可和公共领域数据,克服技术挑战,例如确保可靠的元数据和数字化物理记录,促进跨领域合作,负责任地管理和发布数据集。

🤝 实践步骤:该框架包含数据源、处理和治理的实践步骤,包括使用工具检测开放许可内容,整合元数据标准,鼓励社区合作,并支持预处理的透明性和可重复性,以减少法律风险。

Large language models rely heavily on open datasets to train, which poses significant legal, technical, and ethical challenges in managing such datasets. There are uncertainties around the legal implications of using data based on varying copyright laws and changing regulations regarding safe usage. The lack of global standards or centralized databases to validate and license datasets and incomplete or inconsistent metadata makes it impossible to assess the legal status of works. Technical barriers also relate to access to digitized public domain material. Most open datasets are not governed and have not implemented any kind of legal safety net for their contributors, exposing them to dangers and making them impossible to scale up. While intended to create more transparency and collaborative work, they do little or nothing to engage broader social challenges such as diversity and accountability and often exclude underrepresented languages and viewpoints. 

Current methods of building open datasets for LLMs often lack clear legal frameworks and face significant technical, operational, and ethical challenges. Traditional methods depend on incomplete metadata, complicating verifying copyright status and compliance across different regions with different laws. Digitization of public domain materials and making them accessible is challenging because big projects like Google Books restrict usage, which prevents the construction of open datasets. Volunteer-driven projects lack structured governance, which exposes the contributors to legal risks. Such gaps prevent equal access, prevent diversity in data representation, and concentrate power in a few dominant organizations. This creates an ecosystem where open datasets struggle to compete with proprietary models, reducing accountability and slowing progress toward transparent and inclusive AI development.

To mitigate issues in metadata encoding, data sourcing, and processing for machine learning datasets, researchers proposed a framework focused on building a reliable corpus using openly licensed and public domain data for training large language models (LLMs). The framework emphasizes overcoming technical challenges like ensuring reliable metadata and digitizing physical records. It promotes cross-domain cooperation to responsibly curate, govern, and release these datasets while promoting competition in the LLM ecosystem. It also emphasizes metadata standards, reproducibility for accountability, and ensuring data source diversity as an alternative to more traditional methods lacking structured governance and transparency.

Researchers included all the practical steps of sourcing, processing, and governing datasets. Tools for detecting openly licensed content were used to ensure high-quality data. The framework integrated standards for metadata consistency, emphasized digitization, and encouraged collaboration with communities to create datasets. It also supported transparency and reproducibility in preprocessing and addressed potential biases and harmful content in a robust and inclusive system for training LLMs while reducing legal risks. The framework also highlights engaging with underrepresented communities to build diverse datasets and create clearer, machine-readable terms of Use. Additionally, making the open data ecosystem sustainable should come through proposed funding models on public funding from both tech companies and cultural institutions to ensure sustainable participation.

Finally, the researchers provided a clear scenario with a broadly outlined plan on how to approach the issues discussed within the context of training LLMs on non-licensed data, with a focus on the openness of the datasets and the efforts made by different spheres. Initiatives such as emphasizing metadata standardization, enhancing the digitization process, and responsible governance were intended to make the artificial intelligence ecosystem more open. The works build the foundation for future works where further probing into newer innovations in dataset management, AI governance, and advancements of the technologies that enhance the accessibility of data while addressing the problem of ethical and legal challenges.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Step Towards Best Practices for Open Datasets for LLM Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 开放数据集 法律挑战 伦理问题 数据治理
相关文章