MarkTechPost@AI 2024年12月09日
Lavita AI Introduces Medical Benchmark for Advancing Long-Form Medical Question Answering with Open Models and Expert-Annotated Datasets
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Lavita AI团队推出了一项公开的医学基准测试,用于评估长篇医学问答系统。该基准包含1298个由医学专家标注的真实医疗问题,并从正确性、有用性、推理能力、危害性、效率和偏见六个方面评估模型性能。研究发现,开源模型Llama-3.1-405B-Instruct的表现优于商业模型GPT-4o,而专门的临床模型Meditron3-70B并未比其基础模型Llama-3.1-70B-Instruct有显著优势。这表明,经过良好训练的开放模型足以胜任医疗问答任务,为注重隐私和透明度的医疗AI提供了可行的开源解决方案。

🔬Lavita AI推出了一套公开的长篇医学问答基准测试,用于全面评估长篇医学问答系统,填补了现有基准的不足,推动医疗AI领域的发展。

🩺该基准包含1298个由医学专业人员标注的真实医疗问题,涵盖基础、中级和高级三个难度级别,确保了数据集的多样性和高质量。

📊该基准从六个关键标准评估模型:正确性、有用性、推理能力、危害性、效率和偏见,以全面评估模型在实际应用中的表现。

🏆研究发现,开源模型Llama-3.1-405B-Instruct在所有指标上均优于商业模型GPT-4o,挑战了封闭的、特定领域模型优于开放的、通用模型的观念。

💡专门的临床模型Meditron3-70B并未比其通用基础模型Llama-3.1-70B-Instruct表现出显著优势,这引发了对特定领域调整的附加价值的质疑。

Medical question-answering (QA) systems are critical in modern healthcare, providing essential tools for medical practitioners and the public. Long-form QA systems differ significantly from simpler models by offering detailed explanations reflecting real-world clinical scenarios’ complexity. These systems must accurately interpret nuanced questions, often with incomplete or ambiguous information, and produce reliable, in-depth answers. With the increasing reliance on AI models for health-related inquiries, the demand for effective long-form QA systems is growing. These systems improve healthcare accessibility and provide an avenue for refining AI’s capabilities in decision-making and patient engagement.

Despite the potential of long-form QA systems, one major issue is the need for benchmarks to evaluate the performance of LLMs in generating long-form answers. Existing benchmarks are often limited to automatic scoring systems and multiple-choice formats, failing to reflect real-world clinical settings’ intricacies. Also, many benchmarks are closed-source and lack medical expert annotations. This lack of transparency and accessibility stifles progress in developing robust QA systems that can handle complex medical inquiries effectively. Adding to it, some existing datasets have been found to contain errors, outdated information, or overlap with training data, further compromising their utility for reliable assessments.

Various methods and tools have been employed to address these gaps, but they come with limitations. Automatic evaluation metrics and curated multiple-choice datasets, such as MedRedQA and HealthSearchQA, provide baseline assessments but do not encompass the broader context of long-form answers. Hence, the absence of diverse, high-quality datasets and well-defined evaluation frameworks has led to suboptimal development of long-form QA systems.

A team of researchers from Lavita AI, Dartmouth Hitchcock Medical Center, and Dartmouth College introduced a publicly accessible benchmark designed to evaluate long-form medical QA systems comprehensively. The benchmark includes over 1,298 real-world consumer medical questions annotated by medical professionals. This initiative incorporates various performance criteria, such as correctness, helpfulness, reasoning, harmfulness, efficiency, and bias, to assess the capabilities of both open and closed-source models. The benchmark ensures a diverse and high-quality dataset by including annotations from human experts and utilizing advanced clustering techniques. The researchers also employed GPT-4 and other LLMs for semantic deduplication and question curation, resulting in a robust resource for model evaluation.

The creation of this benchmark involved a multi-phase approach. The researchers collected over 4,271 user queries across 1,693 conversations from Lavita Medical AI Assist, filtering and deduplicating them to produce 1,298 high-quality medical questions. Using semantic similarity analysis, they reduced redundancy and ensured that the dataset represented a wide range of scenarios. Queries were categorized into three difficulty levels, basic, intermediate, and advanced, based on the complexity of the questions and the medical knowledge required to answer them. The researchers then created annotation batches, each containing 100 questions, with answers generated by various models for pairwise evaluation by human experts.

The benchmark’s results revealed insights into the performance of different LLMs. Smaller-scale models like AlpaCare-13B outperformed others like BioMistral-7B in most criteria. Surprisingly, the state-of-the-art open model Llama-3.1-405B-Instruct outperformed the commercial GPT-4o across all metrics, including correctness, efficiency, and reasoning. These findings challenge the notion that closed, domain-specific models inherently outperform open, general-purpose models. Also, the results showed that Meditron3-70B, a specialized clinical model, did not significantly surpass its base model, Llama-3.1-70B-Instruct, raising questions about the added value of domain-specific tuning.

Some of the key takeaways from the research by Lavita AI:

In conclusion, this study addresses the lack of robust benchmarks for long-form medical QA by introducing a dataset of 1,298 real-world medical questions annotated by experts and evaluated across six performance metrics. Results highlight the superior performance of open models like Llama-3.1-405B-Instruct, which outperformed the commercial GPT-4o. Specialized models such as Meditron3-70B showed no significant improvements over general-purpose counterparts, suggesting the adequacy of well-trained open models for medical QA tasks. These findings underscore the viability of open-source solutions for privacy-conscious and transparent healthcare AI.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post Lavita AI Introduces Medical Benchmark for Advancing Long-Form Medical Question Answering with Open Models and Expert-Annotated Datasets appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

医学问答 基准测试 人工智能 开放模型 医疗AI
相关文章