TechCrunch News 04月19日 05:16
OpenAI’s new reasoning AI models hallucinate more
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI最新发布的o3和o4-mini模型在许多方面都达到了前沿水平,但它们在产生“幻觉”,即虚构信息方面,表现却不如之前的模型。研究表明,新模型在编码和数学等领域表现更优,但更容易产生幻觉,例如在PersonQA测试中,o3和o4-mini的幻觉率分别是33%和48%,远高于之前的模型。尽管OpenAI正在努力解决这个问题,但幻觉问题仍然是影响AI模型准确性和可靠性的一个主要挑战。

🤔 OpenAI的o3和o4-mini模型在推理能力上有所提升,尤其是在编码和数学任务中表现出色,但与此同时,它们更容易产生幻觉。

📈 OpenAI的内部测试显示,o3和o4-mini在PersonQA测试中的幻觉率远高于之前的模型,例如o3的幻觉率为33%,o4-mini则高达48%。

💡研究人员推测,这种幻觉增加可能与用于o系列模型的强化学习有关,它可能放大了通常由标准后训练流程缓解的问题。

🔗 模型幻觉可能导致输出不准确的信息,例如提供无效的网站链接,这使得这些模型在需要高准确性的应用场景中,例如法律领域,难以被企业采用。

🔍 一种可能的解决方案是为模型提供网络搜索功能,例如OpenAI的GPT-4o在结合网络搜索后,在SimpleQA上的准确率达到了90%。

OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up — in fact, they hallucinate more than several of OpenAI’s older models.

Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today’s best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn’t seem to be the case for o3 and o4-mini.

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.

In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company’s in-house benchmark for measuring the accuracy of a model’s knowledge about people. That’s roughly double the hallucination rate of OpenAI’s previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. O4-mini did even worse on PersonQA — hallucinating 48% of the time.

Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro “outside of ChatGPT,” then copied the numbers into its answer. While o3 has access to some tools, it can’t do that.

“Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email to TechCrunch.

Sarah Schwettmann, co-founder of Transluce, added that o3’s hallucination rate may make it less useful than it otherwise would be.

Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, told TechCrunch that his team is already testing o3 in their coding workflows, and that they’ve found it to be a step above the competition. However, Katanforoosh says that o3 tends to hallucinate broken website links. The model will supply a link that, when clicked, doesn’t work.

Hallucinations may help models arrive at interesting ideas and be creative in their “thinking,” but they also make some models a tough sell for businesses in markets where accuracy is paramount. For example, a law firm likely wouldn’t be pleased with a model that inserts lots of factual errors into client contracts.

One promising approach to boosting the accuracy of models is giving them web search capabilities. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA. Potentially, search could improve reasoning models’ hallucination rates, as well — at least in cases where users are willing to expose prompts to a third-party search provider.

If scaling up reasoning models indeed continues to worsen hallucinations, it’ll make the hunt for a solution all the more urgent.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.

In the last year, the broader AI industry has pivoted to focus on reasoning models after techniques to improve traditional AI models started showing diminishing returns. Reasoning improves model performance on a variety of tasks without requiring massive amounts of computing and data during training. Yet it seems reasoning also may lead to more hallucinating — presenting a challenge.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI AI模型 幻觉 推理能力 GPT-4o
相关文章