OpenAI blog 07月29日 08:36
Pioneering an AI clinical copilot with Penda Health
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项针对近4万名患者的就诊研究发现,使用AI辅助工具的临床医生在诊断和治疗方面犯的错误显著减少。该研究在肯尼亚的Penda Health诊所进行,通过AI Consult系统为医生提供实时建议。结果显示,AI辅助组的诊断错误率相对降低了16%,治疗错误率降低了13%。研究强调了AI模型能力、临床流程整合以及积极的部署和培训是AI在医疗领域成功应用的关键因素。这为AI在改善全球医疗服务质量和效率提供了有力的实践范例。

💡 AI辅助工具有效提升医疗质量:一项涉及近4万次患者就诊的研究表明,使用AI“AI Consult”工具的医生在诊断和治疗中错误率均有显著下降。具体而言,与未使用AI的医生相比,使用AI的医生在诊断上的错误率降低了16%,治疗错误率降低了13%。这表明AI在辅助临床决策、减少人为失误方面发挥了重要作用。

🔧 临床整合与积极部署是关键:研究强调,AI工具的成功应用不仅在于模型本身的性能,更在于其如何融入临床工作流程以及医生如何被引导使用。Penda Health通过与临床医生共同开发“AI Consult”,确保其成为流程的辅助而非干扰。此外,通过“连接、衡量、激励”等积极部署策略,帮助医生理解AI的价值并有效使用,将“红色警报”(安全关键问题)的未处理率从40%大幅降低至20%,显著提升了AI的实际效果。

🚀 AI赋能医生学习与成长:使用AI辅助工具的医生不仅在错误率上有所改善,他们的临床判断能力和知识广度也得到了提升。许多医生反馈AI Consult是“诊室里的顾问”,帮助他们做出正确的临床判断,并将其视为一个“学习工具”,能够拓宽医学知识、磨练临床技能。数据显示,AI组的医生在研究期间触发的AI警报数量呈下降趋势,表明他们通过AI的反馈学习并改进了自身的实践。

📊 潜在的广泛影响与未来展望:尽管该研究未显示AI对患者直接健康结果(如感觉好转的比例)有统计学上的显著差异,但AI Consult在预防潜在伤害方面显示出潜力,并且在零报告案例中导致AI Consult建议造成伤害的情况。研究认为AI辅助工具(临床助手)是医疗AI应用的一个有希望的早期模型,并预示着未来AI将在全球医疗体系中成为提供更好患者结果的标准部分,但仍需进一步研究以验证其对患者长期结果的影响并推广其成功部署模式。

July 22, 2025

Publication

Study of 40,000 patient visits finds clinicians using AI copilot made fewer errors.

Loading…

AI systems have the potential to improve human health globally—to make reliable health information universally available, help clinicians deliver better care, and empower people to better understand and advocate for their health. 

Large language model (LLM) performance and safety in health continue to advance. OpenAI model performance on HealthBench doubled from GPT‑4o to o3, and frontier models often outperform experts on tasks like diagnostic reasoning(opens in a new window) and clinical summarization(opens in a new window). Yet adoption towards solving real-world patient and clinician challenges remains slow. To realize the potential of LLMs in health, the ecosystem will need to close the model-implementation gap—the chasm between what models can do and how they are used in practice. 

To advance research on real-world implementation, OpenAI partnered with Penda Health(opens in a new window), a primary care provider operating in Nairobi, Kenya since 2012, to conduct a novel study of Penda’s LLM-powered clinician copilot. Penda built their copilot, AI Consult, to provide clinicians with LLM-written recommendations at key points during a patient visit. AI Consult acts as a real-time safety net that activates only when there might be an error, keeping clinicians fully in control. 

In a study of 39,849 patient visits across 15 clinics, clinicians with AI Consult had a 16% relative reduction in diagnostic errors and a 13% reduction in treatment errors compared to those without. 

We believe this outcome was the result of three key factors:

  • Capable model: Penda’s copilot used GPT‑4o from August 2024, and models have improved rapidly since. Model performance is no longer the limiting factor.
  • Clinically-aligned implementation: The copilot was co-developed with clinical users to ensure it genuinely supported—rather than disrupted—the flow of care.
  • Active deployment: Penda put considerable effort into helping clinicians understand why and how to use the copilot, which was crucial for uptake.

Today, we are publishing the study findings alongside a closer look at Penda’s successful implementation, offering the ecosystem an early template for the safe and effective use of LLMs to support clinicians.

We engaged extensively with local stakeholders for the study. This quality improvement research project was approved by the AMREF Health Africa Ethical and Scientific Review Committee, the Kenyan Ministry of Health, Digital Health Agency, and the Nairobi County Department of Health, and conducted under a research license from Kenya’s National Commission for Science, Technology and Innovation. 

Primary care, Penda Health, and AI Consult

AI systems could be especially useful in primary care. Primary care clinicians see patients across every age group, organ system, and disease type, often in the same day, requiring a vast breadth of knowledge. This complexity makes medical errors common: the WHO reports(opens in a new window) that patient harm in primary care is both common and preventable. 

Penda Health is a social enterprise that aims to provide high-quality affordable care. Penda has 16 clinics that each provide primary care, urgent care, laboratory services, and a pharmacy. These clinics are open 24/7 and receive nearly half a million patient visits each year. Penda maintains a uniquely strong focus on quality of care, with an active clinician training and quality program, and has developed and tested previous iterations(opens in a new window) of copilot systems.

After ChatGPT’s release, Penda’s Chief Medical Officer, Dr. Robert Korom, recognized how LLMs could enable higher-quality decision support by covering a broader range of conditions and potential errors than previously possible. In response, Penda built one of the earliest LLM clinical copilots, enabling clinicians to seek a second opinion from an LLM when desired. In an internal audit, Penda reviewed 100 LLM outputs from real patient encounters, and found many cases where LLM output was helpful and none where it was harmful. However, this early version of AI Consult achieved limited uptake because it required clinicians to actively request help and interrupted the flow of the patient interaction.

Iterating towards clinically-aligned implementation

In early 2025, Penda developed a new version of AI Consult that acts as a real-time safety net in a clinician’s workflow. This copilot is integrated into the electronic health record Penda clinicians use every day and runs in the background during every visit. As clinicians interact with patients and document patient visits, documentation without patient identifiers is sent to the OpenAI API at key points. AI Consult then provides any needed feedback to clinicians based on the clinical interaction so far. There are three types of responses that can be returned: 

  • Green: indicates no concerns; appears as a green checkmark.
  • Yellow: indicates moderate concerns; appears as a yellow ringing bell that clinicians can choose whether to view.
  • Red: indicates safety-critical issues; appears as a pop-up that clinicians are required to view before continuing.

Penda designed AI Consult to ensure patient safety. The copilot acts as a safety net, identifying potential errors for a clinician to verify rather than taking actions on behalf of clinicians. Importantly, clinicians drive the workflow at every step: when the copilot identifies potential errors, clinicians can choose whether to modify their decisions based on the feedback, and the final decision belongs to the clinician. AI Consult was tailored to Penda’s context, with prompts including Kenyan epidemiological context, guidance on local clinical guidelines, and standard procedures at Penda’s clinics.

AI Consult flags an important missing diagnosis of iron deficiency anemia, leading the clinician to add this diagnosis so it can be appropriately treated.

Initial documentation
Investigations conducted: Full Haemogram (FHG):* WBC: 12.38* HGB: 9.90* HCT: 30.70* Plt: 248.00* RBC (Full Haemogram): 5.26* MCV: 58.30* MCH: 18.80* MCHC: 32.20Streptococcus A Antigen Test:* Result: NegativeDiagnosis: tonsilitis, acute bacterial
Contribute to AI Consult response
AI Consult response:
Reasoning:The clinical documentation shows the presence of microcytic anemia with a hemoglobin level of 9.90 g/dL and an MCV of 58.30 fL, indicative of anemia that is not addressed in the diagnosis. The presence of microcytic anemia should be evaluated alongside the current symptom of vomiting.Recommendations:Reevaluate the diagnosis considering the significant finding of microcytic anemia, which is not addressed. Investigate underlying causes of anemia, such as iron deficiency or chronic disease. Consider additional testing for iron studies or further dietary investigation. Ensure anemia is included in the management plan to address this critical finding.
Key AI Consult findings
Final documentation
Investigations conducted: Full Haemogram (FHG):* WBC: 12.38* HGB: 9.90* HCT: 30.70* Plt: 248.00* RBC (Full Haemogram): 5.26* MCV: 58.30* MCH: 18.80* MCHC: 32.20Streptococcus A Antigen Test:* Result: NegativeDiagnoses:tonsilitis, acute bacterial, Emesis, unspecified, Iron Deficiency Anemia, Group A Strep pharyngitis score <2
Changes made after AI Consult

Active deployment for effective clinician uptake

Penda deployed AI Consult to a randomly-selected half of its clinicians as part of its quality improvement practice. This deployment occurred in two phases—the induction period (January 30–February 28) and the main period (March 1–April 17). 

During the induction period, Penda used clinician feedback to improve the copilot. This included addressing technical issues that couldn’t be identified in testing (e.g., triggering inconsistently at times) and clinician workflow (e.g., triggering for missing blood pressure on a child’s visit, even though Penda doesn’t routinely take the blood pressure of children). In this period, Penda also noticed that clinicians were early in learning to use AI Consult—for example, they often ignored red alerts, because they weren’t aware of the importance of these alerts—which indicated the importance of helping clinicians use the copilot well. 

During the main period, Penda took several steps to help clinicians use AI Consult better. This included: 

  • Connection: Peer champions and branch managers explained why the copilot mattered, walked colleagues through its strengths and limitations, and offered one-on-one coaching to support uptake. 
  • Measurement: Penda tracked how often clinicians interacted with AI Consult recommendations and reached out with personalized coaching.
  • Incentives: Penda quality leadership recognized clinicians and clinics that used AI Consult well.

Penda collaborated with OpenAI to analyze the impacts of the copilot’s deployment, comparing care delivered by clinicians who did and did not have access to AI Consult. OpenAI provided financial support for the study and consulted on further development of the copilot. 

Only patients who individually consented to data being included in quality improvement research were included in the analysis, and they were able to withdraw their data on request. The AMREF Ethical and Scientific Review Committee determined that additional consent was not needed for this study.

Study results: AI Consult substantially reduced diagnostic and treatment errors

The study analyzed data from 39,849 patient visits: 20,859 in the group with AI Consult (the AI group) and 18,990 in the group without (the non-AI group). 

Effect on quality of care

108 independent physicians (29 from Kenya) rated the final documentation and decisions from 5666 randomly selected visits to identify errors. They rated four dimensions: the quality of the history; how appropriate the investigations ordered were; whether the diagnosis was correct; and whether the treatment was correct.

Errors in all four categories were significantly lower in the AI group than in the non-AI group. History-taking errors were reduced by 32%, investigations errors by 10%, diagnostic errors by 16%, and treatment errors by 13%. This effect was even larger in cases where AI Consult would have returned at least one red alert: in these visits, AI reduced diagnostic errors by 31% and treatment errors by 18%.

Significance levels are denoted by stars: ★ for p ≤ 0.05, ★★ for p ≤ 0.01, and ★★★ for p ≤ 0.001.

These effect sizes are comparable to antibiotic stewardship programs(opens in a new window) or alerts(opens in a new window) to encourage statin prescriptions in patients who need it, yet come from a single system that can support a wide range of clinical decisions. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone.

We also examined the specific types of errors reduced by AI Consult. We find that the AI group is less likely to miss key details in the history, miss key investigations, or get the main diagnosis wrong. We also find that clinicians with AI were less likely to give the wrong medications or omit important patient education. 

Significance levels are denoted by stars: ★ for p ≤ 0.05, ★★ for p ≤ 0.01, and ★★★ for p ≤ 0.001.

Effect of active deployment

Penda’s active deployment work was strikingly effective. One of the measures that Penda tracked was the left in red rate: the percentage of visits that had red alerts in any category (or would have had red alerts, for the non-AI group) and where clinicians did not remedy them. 

During the induction period, the left in red rate was similar between the AI and non-AI groups at 35-40%, suggesting that clinicians with AI were only sometimes acting on red alerts. Once Penda began active deployment, the left in red rate in the AI group dropped to 20% while the non-AI group rate stayed near 40%, emphasizing how important active deployment was to AI Consult’s impact.

We surveyed Penda clinicians (anonymous, consented) about AI Consult’s impact on their care quality. All respondents in the AI group reported that AI Consult helped them improve the quality of care they could deliver, with 75% saying the effect was “substantial”.

Clinicians in the AI group didn’t just use AI Consult—they grew with it. One clinician noted that “It has helped me in multiple occasions to make the correct clinical judgement,” while others called it “a consultant in the room” and referred to it as “one of the best innovations to happen at Penda.” They also described it as a “learning tool” that could help them broaden their medical knowledge and sharpen their clinical skills. Study data matched this perception: clinicians with AI triggered fewer red alerts over time (from 45% of visits at the start of the study to 35% at the end), meaning they learned to avoid common pitfalls even before AI Consult feedback. Alongside the enthusiasm, clinicians also noted room for improvement, especially around localization and speed.

Effect on patient outcomes

As part of standard practice at Penda, staff call patients who consent eight days after their visit to ask whether or not they are feeling better. In the AI group, 3.8% of patients were not feeling better, while in the non-AI group, 4.3% of patients were not feeling better. This difference was not statistically significant. The rate of patients seeking additional care outside Penda—another quality signal Penda collects—was also similar between groups.

Penda’s staff can also raise patient safety reports in cases of potential harm. There were 7 reports in the AI group and 5 in the non-AI group, each of which was studied by Penda’s team. In no case did AI Consult recommendations lead to harm. In several cases, AI Consult advice could potentially have prevented harm if available or heeded.

Where we go from here

Our work with Penda was driven by a shared commitment to expanding access to safe, high-quality care. Across the world, patients often have limited access to care or experience preventable harm. We conducted this research not only as a technical exercise, but as an effort to understand how AI can practically and responsibly help clinicians care for people.

Complementing this blog post is a full research paper(opens in a new window) on the study, AI Consult, and Penda’s deployment. We hope that this work provides inspiration and practical guidance for other healthcare organizations to advance the frontier of health AI use cases.

We believe that AI Consult represents an early, promising archetype of a clinical copilot, rather than the final form. We expect the healthcare ecosystem to drive further improvements in implementation, e.g., a voice-first interface to reduce documentation burden, or an agent taking actions in a health record if a clinician confirms. Follow-up studies are needed to further study how these copilots affect patient outcomes, validate these implementations, and distill them into actionable templates for successful, scaled deployment. Penda is now running a randomized controlled trial with PATH to further measure effects on patient outcomes. 

As AI models advance, the primary challenge ahead is no longer model capability but real-world implementation. Closing the model-implementation gap will require coordinated effort across the health AI ecosystem, including rigorous evaluation and iterative deployment in clinical settings. We believe AI-based copilots are beginning to enter the Overton window for responsible adoption. As we continue to study and scale these systems, our hope is that AI can become a trusted part of the standard of care—as it is already becoming at Penda Health—to deliver better patient outcomes globally.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI医疗 临床决策支持 医疗差错 AI Consult Penda Health
相关文章