TechCrunch News 03月21日
Microsoft is exploring a way to credit contributors to AI training data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软启动一项研究项目,旨在评估特定训练数据对生成式AI模型输出的影响。该项目尝试证明模型训练可以有效估算特定数据(如照片和书籍)对其输出的影响。项目关注解决AI生成内容相关的知识产权(IP)纠纷,通过追踪数据来源,激励数据贡献者,并探索数据所有者的补偿机制。项目可能受到Jaron Lanier“数据尊严”概念的启发,并试图应对版权持有者的法律挑战,同时应对AI行业关于“合理使用”的争议。

🧐微软正在研究如何估算特定训练数据对生成AI模型输出的影响,特别是文本、图像等。

⚖️该研究源于对AI生成内容相关的知识产权(IP)诉讼的关注,许多公司使用大量公共数据训练模型,其中一些数据受版权保护。

💡项目旨在通过追踪数据来源,激励数据贡献者,并可能为他们提供报酬,这与Jaron Lanier的“数据尊严”概念相关。

💰已经有公司尝试根据数据贡献者的“整体影响力”来补偿数据所有者,如Bria、Adobe和Shutterstock。

🤔微软此举可能旨在应对版权持有者的法律挑战,并应对AI行业关于“合理使用”的争议,例如OpenAI呼吁美国政府对模型训练进行“合理使用”的编码。

Microsoft is launching a research project to estimate the influence of specific training examples on the text, images, and other types of media that generative AI models create.

That’s per a job listing dating back to December that was recently recirculated on LinkedIn.

According to the listing, which seeks a research intern, the project will attempt to demonstrate that models can be trained in such a way that the impact of particular data — e.g. photos and books — on their outputs can be “efficiently and usefully estimated.”

“Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this,” reads the listing. “[One is,] incentives, recognition, and potentially pay for people who contribute certain valuable data to unforeseen kinds of models we will want in the future, assuming the future will surprise us fundamentally.”

AI-powered text, code, image, video, and song generators are at the center of a number of IP lawsuits against AI companies. Frequently, these companies train their models on massive amounts of data from public websites, some of which is copyrighted. Many of the companies argue that fair use doctrine shields their data-scraping and training practices. But creatives — from artists to programmers to authors — largely disagree.

Microsoft itself is facing at least two legal challenges from copyright holders.

The New York Times sued the tech giant and its sometime collaborator, OpenAI, in December, accusing the two companies of infringing on The Times’ copyright by deploying models trained on millions of its articles. Several software developers have also filed suit against Microsoft, claiming that the firm’s GitHub Copilot AI coding assistant was unlawfully trained using their protected works.

Microsoft’s new research effort, which the listing describes as “training-time provenance,” reportedly has the involvement of Jaron Lanier, the accomplished technologist and interdisciplinary scientist at Microsoft Research. In an April 2023 op-ed in The New Yorker, Lanier wrote about the concept of “data dignity,” which to him meant connecting “digital stuff” with “the humans who want to be known for having made it.”

“A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output,” Lanier wrote. “For instance, if you ask a model for ‘an animated movie of my kids in an oil-painting world of talking cats on an adventure,’ then certain key oil painters, cat portraitists, voice actors, and writers — or their estates — might be calculated to have been uniquely essential to the creation of the new masterpiece. They would be acknowledged and motivated. They might even get paid.”

There are, not for nothing, already several companies attempting this. AI model developer Bria, which recently raised $40 million in venture capital, claims to “programmatically” compensate data owners according to their “overall influence.” Adobe and Shutterstock also award regular payouts to dataset contributors, although the exact payout amounts tend to be opaque.

Few large labs have established individual contributor payout programs outside of inking licensing agreements with publishers, platforms, and data brokers. They’ve instead provided means for copyright holders to “opt out” of training. But some of these opt-out processes are onerous, and only apply to future models — not previously-trained ones.

Of course, Microsoft’s project may amount to little more than a proof of concept. There’s precedent for that. Back in May, OpenAI said it was developing similar technology that would let creators specify how they want their works to be included in — or excluded from — training data. But nearly a year later, the tool has yet to see the light of day, and it often hasn’t been viewed as a priority internally.

Microsoft may also be trying to “ethics wash,” here — or head off regulatory and/or court decisions disruptive to its AI business.

But that the company is investigating ways to trace training data is notable in light of other AI labs’ recently expressed stances on fair use. Several of the top labs, including Google and OpenAI, have published policy documents recommending that the Trump Administration weaken copyright protections as they relate to AI development. OpenAI has explicitly called on the U.S. government to codify fair use for model training, which it argues would free developers from burdensome restrictions.

Microsoft didn’t immediately respond to a request for comment.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

微软 生成式AI 数据追踪 版权 人工智能
相关文章