MarkTechPost@AI 2024年09月03日
Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LAION发布Re-LAION5B数据集,改进原数据集安全问题,确保合法合规,与多组织合作进行内容过滤,推动机器学习研究。

🎯LAION致力于推进机器学习研究,开发开放透明数据集。原LAION-5B数据集在训练和评估基础模型中起重要作用,但存在非法内容风险。2023年12月,斯坦福互联网观测站报告发现问题,LAION采取行动撤回数据集。

💪Re-LAION5B是全面安全修订的成果,与多个关键伙伴合作,利用哈希值识别并移除了2236个可疑链接,包括此前发现的1008个,且避免了法律和伦理问题,提供了两个版本。

🚀LAION的安全和透明承诺不止于此,公开更新数据集的元数据,便于第三方清理衍生数据集,为创建网络规模数据集的组织提供合作模式和资源建议。

📣LAION鼓励研究者和组织迁移到更新版本,强调采用最佳实践处理潜在安全问题,确保研究工作基于安全合规的数据集。

LAION, a prominent non-profit organization dedicated to advancing machine learning research by developing open and transparent datasets, has recently released Re-LAION 5B. This updated version of the LAION-5B dataset marks a milestone in the organization’s ongoing efforts to ensure the safety and legal compliance of web-scale datasets used in foundational model research. The new dataset addresses critical issues related to potential illegal content, notably Child Sexual Abuse Material (CSAM), that were identified in the original LAION-5B.

Background and Motivation

The original LAION-5B dataset, released in 2022, was designed as a web-scale, text-link-to-images pair dataset instrumental in training and evaluating foundation models. These models, which improve their performance as they scale in terms of data, model size, and computational resources, are crucial for advancing the field of machine learning. However, the vastness and openness of the internet, from which the data was sourced, presented significant challenges in ensuring that the dataset was entirely free of illegal content.

In December 2023, the Stanford Internet Observatory, led by researcher David Thiel, published a report identifying 1,008 links within the LAION-5B dataset that potentially pointed to CSAM. This discovery prompted LAION to take immediate action, temporarily withdrawing the dataset from public access. The findings underscored the limitations of the filtering mechanisms originally employed by LAION despite the organization’s best efforts to exclude such material.

The Re-LAION 5B Update

Re-LAION 5B represents the culmination of a comprehensive safety revision process in collaboration with several key partners, including the Internet Watch Foundation (IWF), the Canadian Center for Child Protection (C3P), and the Stanford Internet Observatory. These organizations provided LAION with lists of MD5 and SHA hashes corresponding to known CSAM and other illegal content. By leveraging these hashes, LAION was able to identify and remove 2,236 suspect links from the dataset systematically. This total includes the 1,008 links initially identified by the Stanford Internet Observatory.

Importantly, the filtering process employed in creating Re-LAION 5B allowed for removing potentially illegal content without requiring LAION’s researchers to directly access or inspect the content, thereby avoiding legal and ethical pitfalls. The updated dataset, now free of links to suspected CSAM, is available in two versions: Re-LAION-5B research and Re-LAION-5B research-safe. The former retains a higher threshold for potentially sensitive content, while the latter version further filters out the majority of Not Safe For Work (NSFW) material.

Ensuring Ongoing Safety and Compliance

LAION’s commitment to safety and transparency extends beyond the release of Re-LAION 5B. The organization has made the metadata from the updated dataset available to third parties, enabling them to clean their derivatives of LAION-5B by applying similar filtering techniques. This approach enhances the safety of derivative datasets and preserves the usability of LAION-5B as a reference dataset for ongoing research.

The release of Re-LAION 5B also sets a new standard for safety in creating web-scale datasets. By partnering with expert organizations like IWF and C3P, LAION has demonstrated the importance of collaboration in addressing the challenges posed by the huge and often unregulated content on the public web. This collaborative approach offers a model for other organizations engaged in similar work, highlighting the value of shared expertise and resources in ensuring the safety and integrity of research datasets.

A Call to Action for the Research Community

In light of the improvements made in Re-LAION 5B, LAION strongly encourages all researchers and organizations still using the original LAION-5B dataset to migrate to the updated version. By doing so, they can ensure that their work is based on a dataset that has been thoroughly vetted for safety and legal compliance. LAION also recommends that organizations involved in dataset creation from public web data partner with entities like IWF and C3P obtain hash lists and other resources necessary for effective filtering.

LAION’s experience underscores the need for the broader research community to adopt and adhere to best practices for handling potential safety issues. This includes timely and direct communication of findings & proactive measures to address risks associated with large-scale web-derived datasets.

Conclusion

Re-LAION 5B is a significant step forward in LAION’s mission to provide open, transparent, and safe datasets for the machine learning research community. By addressing the issues identified in the original LAION-5B dataset and setting a new standard for safety in web-scale datasets, LAION has reaffirmed its commitment to advancing the field of ML responsibly and ethically. As researchers and professionals continue to explore the potential of foundation models, datasets like Re-LAION 5B will play an important role in ensuring that this work is conducted on a solid and safe foundation.


Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’

The post Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LAION Re-LAION5B 机器学习 数据集安全 内容过滤
相关文章