MarkTechPost@AI 前天 21:51
SDBench and MAI-DxO: Advancing Realistic, Cost-Aware Clinical Reasoning with AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了SDBench,一个基于《新英格兰医学杂志》病例的交互式诊断基准,旨在模拟真实的临床决策过程。研究人员还开发了MAI-DxO,一个与医生共同设计的系统,通过模拟虚拟医疗小组来选择高价值、经济有效的测试。实验结果表明,MAI-DxO在提高诊断准确性的同时,显著降低了诊断成本,优于现有模型和医生。这项研究为AI在医疗领域的应用提供了新的视角,并为未来的临床实践和医疗教育带来了潜力。

🩺 传统的AI医疗评估依赖于简化的静态场景,无法反映临床实践的动态性。医生通过提问和解读新信息来逐步调整诊断方法,而静态评估往往忽略了过早决策和过度检测等问题。

💡 SDBench是一个基于304个《新英格兰医学杂志》真实诊断病例的交互式基准,要求AI系统像医生一样,在做出最终诊断前主动提问和进行测试。这种方法更贴近真实的临床决策过程。

🤖 研究人员开发了MAI-DxO,一个与医生共同设计的系统,它模拟虚拟医疗小组,选择高价值、经济有效的测试。当与OpenAI的o3模型结合使用时,MAI-DxO的准确率高达85.5%,同时显著降低了诊断成本。

💰 实验结果表明,MAI-DxO在SDBench上 consistently 优于现成的模型和医生。例如,它在每个病例花费4735美元的情况下,达到了81.9%的准确率,而现成的O3模型在花费7850美元的情况下,准确率为78.6%。

🌍 尽管SDBench和MAI-DxO展现出巨大的潜力,但目前的研究仍存在局限性,例如缺乏对日常情况和真实世界约束的考虑。未来的工作将致力于在真实的诊所和资源匮乏的环境中测试该系统,以期对全球健康和医学教育产生影响。

AI has the potential to make expert medical reasoning more accessible, but current evaluations often fall short by relying on simplified, static scenarios. Real clinical practice is far more dynamic; physicians adjust their diagnostic approach step by step, asking targeted questions and interpreting new information as it comes. This iterative process helps them refine hypotheses, weigh costs and benefits of tests, and avoid jumping to conclusions. While language models have shown strong performance on structured exams, these tests don’t reflect the real-world complexity, where premature decisions and over-testing remain serious concerns often missed by static assessments. 

Medical problem-solving has been explored for decades, with early AI systems utilizing Bayesian frameworks to guide sequential diagnoses in specialties such as pathology and trauma care. However, these approaches faced challenges due to the need for extensive expert input. Recent studies have shifted toward using language models for clinical reasoning, often evaluated through static, multiple-choice benchmarks that are now largely saturated. Projects like AMIE and NEJM-CPC introduced more complex case material but still relied on fixed vignettes. While some newer approaches assess conversational quality or basic information gathering, few capture the full complexity of real-time, cost-sensitive diagnostic decision-making. 

To better reflect real-world clinical reasoning, researchers from Microsoft AI developed SDBench, a benchmark based on 304 real diagnostic cases from the New England Journal of Medicine, where doctors or AI systems must interactively ask questions and order tests before making a final diagnosis. A language model acts as a gatekeeper, revealing information only when specifically requested. To improve performance, they introduced MAI-DxO, an orchestrator system co-designed with physicians that simulates a virtual medical panel to choose high-value, cost-effective tests. When paired with models like OpenAI’s o3, it achieved up to 85.5% accuracy while significantly reducing diagnostic costs. 

The Sequential Diagnosis Benchmark (SDBench) was built using 304 NEJM Case Challenge scenarios (2017–2025), covering a wide range of clinical conditions. Each case was transformed into an interactive simulation where diagnostic agents could ask questions, request tests, or make a final diagnosis. A Gatekeeper, powered by a language model and guided by clinical rules, responded to these actions using realistic case details or synthetic but consistent findings. Diagnoses were evaluated by a Judge model using a physician-authored rubric focused on clinical relevance. Costs were estimated using CPT codes and pricing data to reflect real-world diagnostic constraints and decision-making. 

The researchers evaluated various AI diagnostic agents on the SDBench and found that MAI-DxO consistently outperformed both off-the-shelf models and physicians. While standard models showed a tradeoff between cost and accuracy, MAI-DxO, built on o3, delivered higher accuracy at lower costs through structured reasoning and decision-making. For instance, it reached 81.9% accuracy at $4,735 per case, compared to off-the-shelf O3’s 78.6% at $7,850. It also proved robust across multiple models and held-out test data, indicating strong generalizability. The system significantly improved weaker models and helped stronger ones utilize resources more efficiently, reducing unnecessary tests through smarter information gathering. 

In conclusion, SDBench is a new diagnostic benchmark that turns NEJM CPC cases into realistic, interactive challenges, requiring AI or doctors to actively ask questions, order tests, and make diagnoses, each with associated costs. Unlike static benchmarks, it mimics real clinical decision-making. The researchers also introduced MAI-DxO, a model that simulates diverse medical personas to achieve high diagnostic accuracy at a lower cost. While current results are promising, especially in complex cases, limitations include a lack of everyday conditions and real-world constraints. Future work aims to test the system in real clinics and low-resource settings, with potential for global health impact and medical education use. 

Check out the Technical Details. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.

The post SDBench and MAI-DxO: Advancing Realistic, Cost-Aware Clinical Reasoning with AI appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI医疗 SDBench MAI-DxO 诊断 临床决策
相关文章