MarkTechPost@AI 04月22日 04:55
LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLMs)在医疗保健领域的应用日益广泛,但其输出的可靠性问题日益凸显。斯坦福大学的研究人员开发了SourceCheckup工具,用于评估LLMs在医疗领域的引用准确性。研究发现,即使是像GPT-4o这样的顶级模型,其生成的答案也难以完全被引用的来源支持,这引发了人们对LLMs在临床决策中的应用可靠性的担忧。研究建议,LLMs需要专门针对准确的引用和验证进行训练或微调,以提高其可信度。

🩺 LLMs在医疗保健领域的使用越来越普遍,但其生成未经证实或不准确的陈述的倾向,对医疗领域构成严重风险,许多临床医生因此对其缺乏信任。

💡 研究人员开发了名为SourceCheckup的自动化工具,用于评估LLMs在医疗响应中引用相关来源的准确性。该工具分析了800个问题和超过58,000个来源-陈述对。

🔎 研究结果表明,50%-90%的LLM生成的答案未得到引用的来源的充分支持,即使是具有网络访问权限的LLMs也难以持续提供有来源支持的响应。

✅ SourceCheckup的评估结果与医学专家验证的结果非常接近,但开源模型在提供有效引用URL方面表现较差。GPT-4o虽然由于其互联网访问而表现更好,但其可靠来源支持的响应仅占55%。

🛠️ 为了提高LLMs的可靠性,研究建议应专门针对准确的引用和验证进行训练或微调,并使用自动化工具来改善LLMs的引用可靠性。

As LLMs become more prominent in healthcare settings, ensuring that credible sources back their outputs is increasingly important. Although no LLMs are yet FDA-approved for clinical decision-making, top models such as GPT-4o, Claude, and MedPaLM have outperformed clinicians on standardized exams like the USMLE. These models are already being utilized in real-world scenarios, including mental health support and the diagnosis of rare diseases. However, their tendency to hallucinate—generating unverified or inaccurate statements—poses a serious risk, especially in medical contexts where misinformation can lead to harm. This issue has become a major concern for clinicians, with many citing a lack of trust and the inability to verify LLM responses as key barriers to adoption. Regulators, such as the FDA, have also emphasized the importance of transparency and accountability, underscoring the need for reliable source attribution in medical AI tools.

Recent improvements, such as instruction fine-tuning and RAG, have enabled LLMs to generate sources when prompted. Yet, even when references are from legitimate websites, there is often little clarity on whether those sources truly support the model’s claims. Prior research has introduced datasets such as WebGPT, ExpertQA, and HAGRID to assess LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLMs themselves to assess attribution quality, as demonstrated in works such as ALCE, AttributedQA, and FactScore. While tools like ChatGPT can assist in evaluating citation accuracy, studies reveal that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in this area.

Researchers from Stanford University and other institutions have developed SourceCheckup, an automated tool designed to evaluate the accuracy with which LLMs support their medical responses with relevant sources. Analyzing 800 questions and over 58,000 source-statement pairs, they found that 50%–90 % of LLM-generated answers were not fully supported by cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access struggled to provide source-backed responses consistently. Validated by medical experts, SourceCheckup revealed significant gaps in the reliability of LLM-generated references, raising critical concerns about their readiness for use in clinical decision-making.

The study evaluated the source attribution performance of several top-performing and open-source LLMs using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions—half from Reddit’s r/AskDocs and half created by GPT-4o using MayoClinic texts—then assessing each LLM’s responses for factual accuracy and citation quality. Responses were broken down into verifiable statements, matched with cited sources, and scored using GPT-4 for support. The framework reported metrics, including URL validity and support, at both the statement and response levels. Medical experts validated all components, and the results were cross-verified using Claude Sonnet 3.5 to assess potential bias from GPT-4.

The study presents a comprehensive evaluation of how well LLMs verify and cite medical sources, introducing a system called SourceCheckup. Human experts confirmed that the model-generated questions were relevant and answerable, and that parsed statements closely matched the original responses. In source verification, the model’s accuracy nearly matched that of expert doctors, with no statistically significant difference found between model and expert judgments. Claude Sonnet 3.5 and GPT-4o demonstrated comparable agreement with expert annotations, whereas open-source models such as Llama 2 and Meditron significantly underperformed, often failing to produce valid citation URLs. Even GPT-4o with RAG, though better than others due to its internet access, supported only 55% of its responses with reliable sources, with similar limitations observed across all models.

The findings underscore persistent challenges in ensuring factual accuracy in LLM responses to open-ended medical queries. Many models, even those enhanced with retrieval, failed to consistently link claims to credible evidence, particularly for questions from community platforms like Reddit, which tend to be more ambiguous. Human evaluations and SourceCheckup assessments consistently revealed low response-level support rates, highlighting a gap between current model capabilities and the standards needed in clinical contexts. To improve trustworthiness, the study suggests models should be trained or fine-tuned explicitly for accurate citation and verification. Additionally, automated tools like SourceCleanup demonstrated promise in editing unsupported statements to improve factual grounding, offering a scalable path to enhance citation reliability in LLM outputs.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLMs 医疗保健 SourceCheckup 引用 人工智能
相关文章