Microsoft is exploring a way to credit contributors to AI training data

Microsoft is launching a research project to estimate the influence of specific training examples on the text, images, and other types of media that generative AI models create.

That’s per a job listing dating back to December that was recently recirculated on LinkedIn.

According to the listing, which seeks a research intern, the project will attempt to demonstrate that models can be trained in such a way that the impact of particular data — e.g. photos and books — on their outputs can be “efficiently and usefully estimated.”

“Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this,” reads the listing. “[One is,] incentives, recognition, and potentially pay for people who contribute certain valuable data to unforeseen kinds of models we will want in the future, assuming the future will surprise us fundamentally.”

AI-powered text, code, image, video, and song generators are at the center of a number of IP lawsuits against AI companies. Frequently, these companies train their models on massive amounts of data from public websites, some of which is copyrighted. Many of the companies argue that fair use doctrine shields their data-scraping and training practices. But creatives — from artists to programmers to authors — largely disagree.

Microsoft itself is facing at least two legal challenges from copyright holders.

The New York Times sued the tech giant and its sometime collaborator, OpenAI, in December, accusing the two companies of infringing on The Times’ copyright by deploying models trained on millions of its articles. Several software developers have also filed suit against Microsoft, claiming that the firm’s GitHub Copilot AI coding assistant was unlawfully trained using their protected works.

Microsoft’s new research effort, which the listing describes as “training-time provenance,” reportedly has the involvement of Jaron Lanier, the accomplished technologist and interdisciplinary scientist at Microsoft Research. In an April 2023 op-ed in The New Yorker, Lanier wrote about the concept of “data dignity,” which to him meant connecting “digital stuff” with “the humans who want to be known for having made it.”

“A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output,” Lanier wrote. “For instance, if you ask a model for ‘an animated movie of my kids in an oil-painting world of talking cats on an adventure,’ then certain key oil painters, cat portraitists, voice actors, and writers — or their estates — might be calculated to have been uniquely essential to the creation of the new masterpiece. They would be acknowledged and motivated. They might even get paid.”

There are, not for nothing, already several companies attempting this. AI model developer Bria, which recently raised $40 million in venture capital, claims to “programmatically” compensate data owners according to their “overall influence.” Adobe and Shutterstock also award regular payouts to dataset contributors, although the exact payout amounts tend to be opaque.

Few large labs have established individual contributor payout programs outside of inking licensing agreements with publishers, platforms, and data brokers. They’ve instead provided means for copyright holders to “opt out” of training. But some of these opt-out processes are onerous, and only apply to future models — not previously-trained ones.

Of course, Microsoft’s project may amount to little more than a proof of concept. There’s precedent for that. Back in May, OpenAI said it was developing similar technology that would let creators specify how they want their works to be included in — or excluded from — training data. But nearly a year later, the tool has yet to see the light of day, and it often hasn’t been viewed as a priority internally.

Microsoft may also be trying to “ethics wash,” here — or head off regulatory and/or court decisions disruptive to its AI business.

But that the company is investigating ways to trace training data is notable in light of other AI labs’ recently expressed stances on fair use. Several of the top labs, including Google and OpenAI, have published policy documents recommending that the Trump Administration weaken copyright protections as they relate to AI development. OpenAI has explicitly called on the U.S. government to codify fair use for model training, which it argues would free developers from burdensome restrictions.

Microsoft didn’t immediately respond to a request for comment.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签