Study of 40,000 patient visits finds clinicians using AI copilot made fewer errors.
AI systems have the potential to improve human health globally—to make reliable health information universally available, help clinicians deliver better care, and empower people to better understand and advocate for their health.
Large language model (LLM) performance and safety in health continue to advance. OpenAI model performance on HealthBench doubled from GPT‑4o to o3, and frontier models often outperform experts on tasks like diagnostic reasoning(opens in a new window) and clinical summarization(opens in a new window). Yet adoption towards solving real-world patient and clinician challenges remains slow. To realize the potential of LLMs in health, the ecosystem will need to close the model-implementation gap—the chasm between what models can do and how they are used in practice.
To advance research on real-world implementation, OpenAI partnered with Penda Health(opens in a new window), a primary care provider operating in Nairobi, Kenya since 2012, to conduct a novel study of Penda’s LLM-powered clinician copilot. Penda built their copilot, AI Consult, to provide clinicians with LLM-written recommendations at key points during a patient visit. AI Consult acts as a real-time safety net that activates only when there might be an error, keeping clinicians fully in control.
In a study of 39,849 patient visits across 15 clinics, clinicians with AI Consult had a 16% relative reduction in diagnostic errors and a 13% reduction in treatment errors compared to those without.
We believe this outcome was the result of three key factors:
- Capable model: Penda’s copilot used GPT‑4o from August 2024, and models have improved rapidly since. Model performance is no longer the limiting factor.
- Clinically-aligned implementation: The copilot was co-developed with clinical users to ensure it genuinely supported—rather than disrupted—the flow of care.
- Active deployment: Penda put considerable effort into helping clinicians understand why and how to use the copilot, which was crucial for uptake.
Today, we are publishing the study findings alongside a closer look at Penda’s successful implementation, offering the ecosystem an early template for the safe and effective use of LLMs to support clinicians.
We engaged extensively with local stakeholders for the study. This quality improvement research project was approved by the AMREF Health Africa Ethical and Scientific Review Committee, the Kenyan Ministry of Health, Digital Health Agency, and the Nairobi County Department of Health, and conducted under a research license from Kenya’s National Commission for Science, Technology and Innovation.
AI systems could be especially useful in primary care. Primary care clinicians see patients across every age group, organ system, and disease type, often in the same day, requiring a vast breadth of knowledge. This complexity makes medical errors common: the WHO reports(opens in a new window) that patient harm in primary care is both common and preventable.
Penda Health is a social enterprise that aims to provide high-quality affordable care. Penda has 16 clinics that each provide primary care, urgent care, laboratory services, and a pharmacy. These clinics are open 24/7 and receive nearly half a million patient visits each year. Penda maintains a uniquely strong focus on quality of care, with an active clinician training and quality program, and has developed and tested previous iterations(opens in a new window) of copilot systems.

After ChatGPT’s release, Penda’s Chief Medical Officer, Dr. Robert Korom, recognized how LLMs could enable higher-quality decision support by covering a broader range of conditions and potential errors than previously possible. In response, Penda built one of the earliest LLM clinical copilots, enabling clinicians to seek a second opinion from an LLM when desired. In an internal audit, Penda reviewed 100 LLM outputs from real patient encounters, and found many cases where LLM output was helpful and none where it was harmful. However, this early version of AI Consult achieved limited uptake because it required clinicians to actively request help and interrupted the flow of the patient interaction.

In early 2025, Penda developed a new version of AI Consult that acts as a real-time safety net in a clinician’s workflow. This copilot is integrated into the electronic health record Penda clinicians use every day and runs in the background during every visit. As clinicians interact with patients and document patient visits, documentation without patient identifiers is sent to the OpenAI API at key points. AI Consult then provides any needed feedback to clinicians based on the clinical interaction so far. There are three types of responses that can be returned:
- Green: indicates no concerns; appears as a green checkmark.
- Yellow: indicates moderate concerns; appears as a yellow ringing bell that clinicians can choose whether to view.
- Red: indicates safety-critical issues; appears as a pop-up that clinicians are required to view before continuing.
Penda designed AI Consult to ensure patient safety. The copilot acts as a safety net, identifying potential errors for a clinician to verify rather than taking actions on behalf of clinicians. Importantly, clinicians drive the workflow at every step: when the copilot identifies potential errors, clinicians can choose whether to modify their decisions based on the feedback, and the final decision belongs to the clinician. AI Consult was tailored to Penda’s context, with prompts including Kenyan epidemiological context, guidance on local clinical guidelines, and standard procedures at Penda’s clinics.
AI Consult flags an important missing diagnosis of iron deficiency anemia, leading the clinician to add this diagnosis so it can be appropriately treated.
Penda deployed AI Consult to a randomly-selected half of its clinicians as part of its quality improvement practice. This deployment occurred in two phases—the induction period (January 30–February 28) and the main period (March 1–April 17).
During the induction period, Penda used clinician feedback to improve the copilot. This included addressing technical issues that couldn’t be identified in testing (e.g., triggering inconsistently at times) and clinician workflow (e.g., triggering for missing blood pressure on a child’s visit, even though Penda doesn’t routinely take the blood pressure of children). In this period, Penda also noticed that clinicians were early in learning to use AI Consult—for example, they often ignored red alerts, because they weren’t aware of the importance of these alerts—which indicated the importance of helping clinicians use the copilot well.
During the main period, Penda took several steps to help clinicians use AI Consult better. This included:
- Connection: Peer champions and branch managers explained why the copilot mattered, walked colleagues through its strengths and limitations, and offered one-on-one coaching to support uptake.
- Measurement: Penda tracked how often clinicians interacted with AI Consult recommendations and reached out with personalized coaching.
- Incentives: Penda quality leadership recognized clinicians and clinics that used AI Consult well.
Penda collaborated with OpenAI to analyze the impacts of the copilot’s deployment, comparing care delivered by clinicians who did and did not have access to AI Consult. OpenAI provided financial support for the study and consulted on further development of the copilot.
Only patients who individually consented to data being included in quality improvement research were included in the analysis, and they were able to withdraw their data on request. The AMREF Ethical and Scientific Review Committee determined that additional consent was not needed for this study.
The study analyzed data from 39,849 patient visits: 20,859 in the group with AI Consult (the AI group) and 18,990 in the group without (the non-AI group).
108 independent physicians (29 from Kenya) rated the final documentation and decisions from 5666 randomly selected visits to identify errors. They rated four dimensions: the quality of the history; how appropriate the investigations ordered were; whether the diagnosis was correct; and whether the treatment was correct.
Errors in all four categories were significantly lower in the AI group than in the non-AI group. History-taking errors were reduced by 32%, investigations errors by 10%, diagnostic errors by 16%, and treatment errors by 13%. This effect was even larger in cases where AI Consult would have returned at least one red alert: in these visits, AI reduced diagnostic errors by 31% and treatment errors by 18%.
Significance levels are denoted by stars: ★ for p ≤ 0.05, ★★ for p ≤ 0.01, and ★★★ for p ≤ 0.001.
These effect sizes are comparable to antibiotic stewardship programs(opens in a new window) or alerts(opens in a new window) to encourage statin prescriptions in patients who need it, yet come from a single system that can support a wide range of clinical decisions. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone.
We also examined the specific types of errors reduced by AI Consult. We find that the AI group is less likely to miss key details in the history, miss key investigations, or get the main diagnosis wrong. We also find that clinicians with AI were less likely to give the wrong medications or omit important patient education.
Significance levels are denoted by stars: ★ for p ≤ 0.05, ★★ for p ≤ 0.01, and ★★★ for p ≤ 0.001.
Penda’s active deployment work was strikingly effective. One of the measures that Penda tracked was the left in red rate: the percentage of visits that had red alerts in any category (or would have had red alerts, for the non-AI group) and where clinicians did not remedy them.
During the induction period, the left in red rate was similar between the AI and non-AI groups at 35-40%, suggesting that clinicians with AI were only sometimes acting on red alerts. Once Penda began active deployment, the left in red rate in the AI group dropped to 20% while the non-AI group rate stayed near 40%, emphasizing how important active deployment was to AI Consult’s impact.
We surveyed Penda clinicians (anonymous, consented) about AI Consult’s impact on their care quality. All respondents in the AI group reported that AI Consult helped them improve the quality of care they could deliver, with 75% saying the effect was “substantial”.
Clinicians in the AI group didn’t just use AI Consult—they grew with it. One clinician noted that “It has helped me in multiple occasions to make the correct clinical judgement,” while others called it “a consultant in the room” and referred to it as “one of the best innovations to happen at Penda.” They also described it as a “learning tool” that could help them broaden their medical knowledge and sharpen their clinical skills. Study data matched this perception: clinicians with AI triggered fewer red alerts over time (from 45% of visits at the start of the study to 35% at the end), meaning they learned to avoid common pitfalls even before AI Consult feedback. Alongside the enthusiasm, clinicians also noted room for improvement, especially around localization and speed.
As part of standard practice at Penda, staff call patients who consent eight days after their visit to ask whether or not they are feeling better. In the AI group, 3.8% of patients were not feeling better, while in the non-AI group, 4.3% of patients were not feeling better. This difference was not statistically significant. The rate of patients seeking additional care outside Penda—another quality signal Penda collects—was also similar between groups.
Penda’s staff can also raise patient safety reports in cases of potential harm. There were 7 reports in the AI group and 5 in the non-AI group, each of which was studied by Penda’s team. In no case did AI Consult recommendations lead to harm. In several cases, AI Consult advice could potentially have prevented harm if available or heeded.
Our work with Penda was driven by a shared commitment to expanding access to safe, high-quality care. Across the world, patients often have limited access to care or experience preventable harm. We conducted this research not only as a technical exercise, but as an effort to understand how AI can practically and responsibly help clinicians care for people.
Complementing this blog post is a full research paper(opens in a new window) on the study, AI Consult, and Penda’s deployment. We hope that this work provides inspiration and practical guidance for other healthcare organizations to advance the frontier of health AI use cases.
We believe that AI Consult represents an early, promising archetype of a clinical copilot, rather than the final form. We expect the healthcare ecosystem to drive further improvements in implementation, e.g., a voice-first interface to reduce documentation burden, or an agent taking actions in a health record if a clinician confirms. Follow-up studies are needed to further study how these copilots affect patient outcomes, validate these implementations, and distill them into actionable templates for successful, scaled deployment. Penda is now running a randomized controlled trial with PATH to further measure effects on patient outcomes.
As AI models advance, the primary challenge ahead is no longer model capability but real-world implementation. Closing the model-implementation gap will require coordinated effort across the health AI ecosystem, including rigorous evaluation and iterative deployment in clinical settings. We believe AI-based copilots are beginning to enter the Overton window for responsible adoption. As we continue to study and scale these systems, our hope is that AI can become a trusted part of the standard of care—as it is already becoming at Penda Health—to deliver better patient outcomes globally.