Unite.AI 2024年12月21日
Monetizing Research for AI Training: The Risks and Best Practices
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着生成式AI需求的增长,高质量的训练数据变得至关重要。学术出版商开始将其研究内容商业化,为大型语言模型提供训练数据。尽管这为出版商创造了新的收入来源,并推动了AI在科学发现中的应用,但也引发了关于研究完整性和可靠性的关键问题。本文探讨了科研数据货币化的兴起、虚假研究的阴影以及对AI训练和信任的潜在影响,并提出了确保AI使用可信数据的最佳实践,强调了出版商、AI公司和研究人员共同努力的重要性。

💰学术出版商通过授权其内容给科技公司,为生成式AI模型提供训练数据,获得了可观的收入,例如Wiley公司今年就通过此类交易获得了超过4000万美元的收入。

⚠️然而,学术界普遍存在虚假研究的问题,许多已发表的研究结果存在缺陷、偏差或不可靠。例如,2023年有超过10000篇论文因伪造或不可靠的结果而被撤回,这严重威胁了AI训练数据的质量和可靠性。

🧪为了减少虚假研究对AI训练的干扰,出版商需要改进同行评审流程,AI公司需要谨慎选择合作对象,AI开发者需要对数据负责,并提高透明度。此外,还应鼓励开放获取高质量的研究,以确保AI开发的包容性和公平性。

As the demand for generative AI grows, so does the hunger for high-quality data to train these systems. Scholarly publishers have started to monetize their research content to provide training data for large language models (LLMs). While this development is creating a new revenue stream for publishers and empowering generative AI for scientific discoveries, it raises critical questions about the integrity and reliability of the research used. This raises a crucial question: Are the datasets being sold trustworthy, and what implications does this practice have for the scientific community and generative AI models?

The Rise of Monetized Research Deals

Major academic publishers, including Wiley, Taylor & Francis, and others, have reported substantial revenues from licensing their content to tech companies developing generative AI models. For instance, Wiley revealed over $40 million in earnings from such deals this year alone​. These agreements enable AI companies to access diverse and expansive scientific datasets, presumably improving the quality of their AI tools.

The pitch from publishers is straightforward: licensing ensures better AI models, benefitting society while rewarding authors with royalties. This business model benefits both tech companies and publishers. However, the increasing trend to monetize scientific knowledge has risks, mainly when questionable research infiltrates these AI training datasets.

The Shadow of Bogus Research

The scholarly community is no stranger to issues of fraudulent research. Studies suggest many published findings are flawed, biased, or just unreliable. A 2020 survey found that nearly half of researchers reported issues like selective data reporting or poorly designed field studies. In 2023, more than 10,000 papers were retracted due to falsified or unreliable results, a number that continues to climb annually. Experts believe this figure represents the tip of an iceberg, with countless dubious studies circulating in scientific databases​.

The crisis has primarily been driven by “paper mills,” shadow organizations that produce fabricated studies, often in response to academic pressures in regions like China, India, and Eastern Europe. It’s estimated that around 2% of journal submissions globally come from paper mills. These sham papers can resemble legitimate research but are riddled with fictitious data and baseless conclusions. Disturbingly, such papers slip through peer review and end up in respected journals, compromising the reliability of scientific insights​. For instance, during the COVID-19 pandemic, flawed studies on ivermectin falsely suggested its efficacy as a treatment, sowing confusion and delaying effective public health responses. This example highlights the potential harm of disseminating unreliable research, where flawed results can have a significant impact.

Consequences for AI Training and Trust

The implications are profound when LLMs train on databases containing fraudulent or low-quality research. AI models use patterns and relationships within their training data to generate outputs. If the input data is corrupted, the outputs may perpetuate inaccuracies or even amplify them. This risk is particularly high in fields like medicine, where incorrect AI-generated insights could have life-threatening consequences.
Moreover, the issue threatens the public's trust in academia and AI. As publishers continue to make agreements, they must address concerns about the quality of the data being sold. Failure to do so could harm the reputation of the scientific community and undermine AI’s potential societal benefits.

Ensuring Trustworthy Data for AI

Reducing the risks of flawed research disrupting AI training requires a joint effort from publishers, AI companies, developers, researchers and the broader community. Publishers must improve their peer-review process to catch unreliable studies before they make it into training datasets. Offering better rewards for reviewers and setting higher standards can help. An open review process is critical here. It brings more transparency and accountability, helping to build trust in the research.
AI companies must be more careful about who they work with when sourcing research for AI training. Choosing publishers and journals with a strong reputation for high-quality, well-reviewed research is key. In this context, it is worth looking closely at a publisher’s track record—like how often they retract papers or how open they are about their review process. Being selective improves the data’s reliability and builds trust across the AI and research communities.

AI developers need to take responsibility for the data they use. This means working with experts, carefully checking research, and comparing results from multiple studies. AI tools themselves can also be designed to identify suspicious data and reduce the risks of questionable research spreading further.

Transparency is also an essential factor. Publishers and AI companies should openly share details about how research is used and where royalties go. Tools like the Generative AI Licensing Agreement Tracker show promise but need broader adoption. Researchers should also have a say in how their work is used. Opt-in policies, like those from Cambridge University Press, offer authors control over their contributions. This builds trust, ensures fairness, and makes authors actively participate in this process.

Moreover, open access to high-quality research should be encouraged to ensure inclusivity and fairness in AI development. Governments, non-profits, and industry players can fund open-access initiatives, reducing reliance on commercial publishers for critical training datasets. On top of that, the AI industry needs clear rules for sourcing data ethically. By focusing on reliable, well-reviewed research, we can build better AI tools, protect scientific integrity, and maintain the public’s trust in science and technology.

The Bottom Line

Monetizing research for AI training presents both opportunities and challenges. While licensing academic content allows for the development of more powerful AI models, it also raises concerns about the integrity and reliability of the data used. Flawed research, including that from “paper mills,” can corrupt AI training datasets, leading to inaccuracies that may undermine public trust and the potential benefits of AI. To ensure AI models are built on trustworthy data, publishers, AI companies, and developers must work together to improve peer review processes, increase transparency, and prioritize high-quality, well-vetted research. By doing so, we can safeguard the future of AI and uphold the integrity of the scientific community.

The post Monetizing Research for AI Training: The Risks and Best Practices appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI 学术出版 数据质量 同行评审 虚假研究
相关文章