Five AI Models Collaborate To Ace US Medical Licensing Exams

A council of five artificial intelligence systems working together scored higher on the United States Medical Licensing Examination than any single chatbot tested to date, according to a new study published October 9, 2025 in PLOS Digital Health.

The team, composed of multiple GPT 4 instances guided by a facilitator algorithm, reached 97 percent, 93 percent, and 94 percent accuracy on questions aligned with USMLE Steps 1, 2 CK, and 3. The result suggests that careful deliberation among independent AI agents can outperform a fast, single answer from one model.

The researchers assembled five copies of the same model and asked each to answer the same multiple choice questions. When the agents disagreed, a facilitator summarized their reasoning and prompted a new round of discussion until the group reached consensus. On a set of 325 publicly available USMLE questions, the council consistently surpassed single model performance and also outperformed simple majority voting among the five agents.

From disagreement to consensus, and to higher accuracy

The most striking gains appeared when the models initially disagreed. Roughly one in five questions required discussion. In those cases, the council corrected over half of the initial majority errors, and when there was not unanimous agreement at the outset, the group still converged on the right answer 83 percent of the time. The facilitator driven process was designed to reward explanations, not just picks, which may have reduced superficial pattern matching and forced each agent to justify its choice.

Lead author Yahya Shaikh argues that collaboration, not uniformity, is the point. He frames variability across models as a useful signal that can be mined for better answers rather than a flaw to be suppressed.

“Instead, embracing variability through teamwork might unlock new possibilities for AI in medicine and beyond.”

The study also reports a practical boundary. The council never arrived at a correct consensus when all members began with incorrect answers. In other words, the process can amplify signal when at least one agent is right, but it did not conjure correctness from thin air. That caveat mirrors ensemble behavior in other domains, where adding diverse but competent voters usually helps, and adding uniformly weak ones does not.

Impressive scores, important limitations

Despite the headline results, the authors are careful about scope. The benchmark consisted of publicly available, text only USMLE items. No images or tables were included. The work did not test clinical decision making with patient data, and the system has not faced real world time pressure. The deliberations can take several rounds, which adds computational cost and latency that may be unacceptable at the bedside. The authors suggest parallelization could mitigate delays, but that is a future engineering task, not a present guarantee.

There are also questions about generality. This experiment used multiple instances of a single model family. Would a mixed council, drawing on models trained with different data and alignment strategies, perform better, worse, or simply argue longer? The paper hints that diversity across models could add value, but it stops short of testing that claim. Likewise, the council’s gains over single shot GPT 4 were measured on a specific question set. Replicating across other medical benchmarks would clarify whether the method consistently travels.

Still, for educators and developers, the practical message is clear. If you can afford extra tokens and seconds, asking several independent models to show their reasoning, then prompting them to reconcile differences, can yield more reliable answers than trusting a single confident response. That is not a grand theory of intelligence. It is an operational trick with measurable benefits on a tough exam.

Co author Zishan Siddiqui underscores that goal, pushing back on hype about raw test prowess in favor of process improvements.

“Instead, we describe a method that improves accuracy by treating AI’s natural response variability as a strength.”

The study lands at a moment when medical AI is shifting from novelty demonstrations to questions of trust, transparency, and accountability. A deliberating council creates transcripts that humans can audit, highlighting where reasoning converged and where it struggled. That record could be as valuable as the final answer, especially in classrooms or quality improvement settings. For clinical use, however, regulators and practitioners will rightly demand prospective trials that test not only accuracy but also safety, cost, and equity.

For now, the council’s success offers a practical reminder. When knowledge is complex and uncertainty is high, second opinions help, even if they come from machines. The work shows that AI does not need to be singular to be strong. It may be stronger, and more trustworthy, when it argues with itself first.

PLOS Digital Health: 10.1371/journal.pdig.0000787

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

Five AI Models Collaborate To Ace US Medical Licensing Exams

From disagreement to consensus, and to higher accuracy

Impressive scores, important limitations

Related

Leave a Comment Cancel reply