MarkTechPost@AI 06月06日 12:10
Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何通过训练使大型语言模型(LLM)在面对不完整或误导性问题时拒绝回答,以减少“幻觉”现象。研究发现,强化微调(RFT)虽然能提高模型输出的逻辑性和结构性,但也会降低其拒绝回答不明确问题的能力。为了解决这个问题,南加州大学的研究人员开发了合成不可回答数学(SUM)数据集,通过在训练中引入不可回答的问题,教会模型识别信息不足的情况并适当回应。实验结果表明,SUM数据集显著提高了模型在不可回答问题上的拒绝率,且未损害其在可回答问题上的准确性,为提升AI系统的可靠性提供了新思路。

🤔 强化微调(RFT)虽然提高了LLM的性能,但可能会导致模型在不明确问题上产生“幻觉”,即自信地给出错误答案。

💡 研究人员开发了合成不可回答数学(SUM)数据集,通过修改现有问题,使其包含缺失信息或逻辑矛盾,从而创建不可回答的问题。

📚 SUM数据集通过将可回答和不可回答的问题混合训练,并指示模型对不可回答的输入回答“我不知道”,从而教会模型识别信息不足的情况。

📈 实验结果表明,在RFT中使用少量SUM数据,可以显著提高模型在不可回答问题上的拒绝率,同时保持其在可回答问题上的准确性,例如Qwen2.5-7B模型在SUM基准测试中拒绝率从0.01提升至0.73。

Reinforcement finetuning uses reward signals to guide the large language model toward desirable behavior. This method sharpens the model’s ability to produce logical and structured outputs by reinforcing correct responses. Yet, the challenge persists in ensuring that these models also know when not to respond—particularly when faced with incomplete or misleading questions that don’t have a definite answer.

The problem arises when language models, after reinforcement finetuning, begin to lose their ability to refuse to answer unclear or ambiguous queries. Instead of signaling uncertainty, the models tend to produce confidently stated but incorrect responses. This phenomenon, identified in the paper as the “hallucination tax,” highlights a growing risk. As models are trained to perform better, they may also become more likely to hallucinate answers in situations where silence would be more appropriate. This is especially hazardous in domains that require high trust and precision.

Tools currently used in training large language models often overlook the importance of refusal behavior. Reinforcement finetuning frameworks tend to reward only correct answers while penalizing incorrect ones, ignoring cases where a valid response should be no answer at all. The reward systems in use do not sufficiently reinforce refusal, resulting in overconfident models. For instance, the paper shows that refusal rates dropped to near zero across multiple models after standard RFT, demonstrating that current training fails to address hallucination properly.

Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems by modifying existing questions through criteria such as missing key information or creating logical inconsistencies. The researchers used DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond accordingly.

SUM’s core technique is to mix answerable and unanswerable problems during training. Questions are modified to become ambiguous or unsolvable while maintaining plausibility. The training prompts instruct models to say “I don’t know” for unanswerable inputs. By introducing only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure allows them to refuse answers more appropriately without impairing their performance on solvable problems.

Performance analysis shows significant improvements. After training with SUM, the Qwen2.5-7B model increased its refusal rate from 0.01 to 0.73 on the SUM benchmark and from 0.01 to 0.81 on the UMWP benchmark. On the SelfAware dataset, refusal accuracy rose dramatically from 0.01 to 0.94. Llama-3.1-8B-Instruct showed a similar trend, with refusal rates improving from 0.00 to 0.75 on SUM and from 0.01 to 0.79 on UMWP. Despite these gains in refusal behavior, accuracy on answerable datasets, such as GSM8K and MATH-500, remained stable, with most changes ranging from 0.00 to -0.05. The minimal drop indicates that refusal training can be introduced without major sacrifices in task performance.

This study outlines a clear trade-off between improved reasoning and trustworthiness. Reinforcement finetuning, while powerful, tends to suppress cautious behavior. The SUM dataset corrects this by teaching models to recognize what they cannot solve. With only a small addition to training data, language models become better at identifying the boundaries of their knowledge. This approach marks a significant step in making AI systems not just smarter but also more careful and honest.


Check out the Paper and Dataset on Hugging Face. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI LLM 强化学习 幻觉 数据集
相关文章