ChatGPT Excels in Clinical Exams Comparable to a Fellow, but Faces Limitations

Release date：2024-05-22 Source：https://www.tctmd.com/news/chatgpt-can-pass-clinical-exams-fellow-some-caveats?utm_source=Sailthru&utm_medium=email&utm_campaign=Issue:%202024-05-21%20MedTech%20Dive%20%5Bissue:62255%5D&utm_term=MedTech%20Dive Author: L.A. McKeown

As artificial intelligence (AI) becomes more integrated into the medical field, researchers are discovering its current limitations, revealing that it still falls short compared to human capabilities. A recent study indicates that while chatbots can pass certification exams with clinician-level accuracy, their performance can be inconsistent.

Emmanouil S. Brilakis, MD, PhD (Minneapolis Heart Institute, MN), the senior author of a study that tested ChatGPT 4.0 (OpenAI) against interventional cardiology fellows, remarked, "It's encouraging, but I don't think we can trust this quite yet." The study aimed to evaluate whether ChatGPT could pass a simulation test for the American College of Cardiology (ACC)/American Board of Internal Medicine (ABIM) Collaborative Maintenance Pathway (CMP).

Since its launch in late 2022, ChatGPT, a unique large language model capable of generating human-like text responses, has sparked debate in the academic medical community. This led some journal editors, such as those from the JACC family of journals and JAMA, to establish policies on the use of AI-based tools in scientific publications. Recently, a small study showed that ChatGPT was fairly proficient at responding to simple cardiovascular disease prevention questions.

Inspired by a survey of cardiologists that highlighted support for AI-enabled tools to enhance care quality and efficiency, Brilakis and his team tested ChatGPT 4.0’s ability to take clinical exams. In a research letter published in JACC: Cardiovascular Interventions, the team reported that on a multiple-choice exam, ChatGPT scored 76.7%, enough to pass, compared to an average score of 82.2% for fellows. The exam included 60 questions and required explanations for incorrect answers.

However, when retested two weeks later, ChatGPT’s score dropped to 65%. Interestingly, it correctly answered three questions it had previously gotten wrong, but ten correct answers from the initial test were incorrect on the retest. Brilakis told TCTMD that this inconsistency is both problematic and unexpected. "AI is a black box in some ways. I thought it would improve on the retest, as humans do," he said. "I don’t know if it’s the algorithm or what, but this is a major concern."

Additionally, ChatGPT struggled with questions involving videos, as it cannot view videos, scoring only 61.7% on a multiple-choice version of the exam. When video-based questions were converted to multiple-choice format, ChatGPT answered all but one correctly. It performed well on image-based questions, correctly answering five out of six.

Brilakis concluded that ChatGPT’s inconsistent performance and poor test-retest reliability indicate that while it may help researchers or clinicians phrase questions more effectively, it is not yet a reliable tool for clinical decision-making. "It’s better than a search engine for complex questions and faster, but its answers need to be confirmed," he said. "Accuracy is still an issue."

Previous: Analysts Predict Rapid Adoption of Pulsed Field Ablation

Next: Speech by Dr. Janil Puthucheary, Senior Minister of State for Communications and Information & Health, at the SingHealth Community Forum, May 17, 2024, 9:15 AM, Sengkang General Hospital Auditorium

Return list

NEWS

ChatGPT Excels in Clinical Exams Comparable to a Fellow, but Faces Limitations