Mashable 04月22日 00:44
OpenAIs o3 and o4-mini hallucinate way higher than previous models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文关注了OpenAI最新推出的推理模型o3和o4-mini的幻觉问题,这些模型在内部测试中表现出比早期模型更高的幻觉率。报告指出,o3的幻觉率为33%,o4-mini更是高达48%,远高于o1的16%。尽管OpenAI声称这些模型在推理能力上有所提升,但其幻觉率的上升引发了对其准确性和可靠性的担忧。文章还对比了其他模型的幻觉率,并指出了评估基准的复杂性,强调了在不同测试方法下结果差异的可能性。

🤔 OpenAI的最新推理模型o3和o4-mini在PersonQA评估中展现了较高的幻觉率,其中o3的幻觉率为33%,o4-mini更是达到了48%,远高于o1模型的16%。

🧐 OpenAI解释说,o3模型倾向于生成更多的主张,这导致了更多准确的同时也带来了更多不准确或幻觉的内容。但OpenAI并未明确其根本原因,表示需要更多研究。

💡 尽管OpenAI的推理模型旨在通过更深入的思考来提高准确性,但与GPT-4.5和GPT-4o等非推理模型相比,其幻觉率的上升引发了对其可靠性的质疑,尤其是在依赖网络搜索获取信息时,模型更易受到影响。

By OpenAI's own testing, its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1.

First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.

The system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."

OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."

However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.

Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.

Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.

That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.

Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.

Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users. Mashable reached out to OpenAI and will update this story with a response.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI 人工智能 幻觉 模型
相关文章