‘Overly cautious’ ChatGPT inaccurate at prescribing correct treatment

ByAndrew (Drew) Rhoades

Fact checked byCarol L. DiBerardino, MLA, ELS

Key takeaways:

LLMs performed poorly overall across three clinical recommendations compared with ED physicians.
The models were overly cautious, while this tendency is not always appropriate for ED settings, a researcher said.

Perspective from Gerard A. Baltazar, DO, FACOS, FACS

ChatGPT may be more likely to recommend unnecessary treatment, such as x-rays or antibiotics, during ED visits compared with physicians, a study results published in Nature Communications suggested.

These inaccuracies could ultimately led to higher costs and potentially harm patients,, while further development on learning language models (LLMs) in these types of care settings is needed, according to one of the study’s co-authors.

PC1024Williams_Graphic_01_WEB — Data derived from: Williams C, et al. *Nat Commun*. 2024;doi:10.1038/s41467-024-52415-1.

“Since ChatGPT came out in 2022, there has been a lot of interest in how these models can be used in health care,” Christopher Y. K. Williams, MB, BChir, a postdoctoral scholar at the University of California, San Francisco, told Healio. “Before any new technology is deployed in clinical practice, it is important to evaluate what are its benefits and — just as importantly — limitations.”

ChatGPT models have shown potential for several uses in clinical practice, such as supporting clinical decision-making and summarizing medical abstracts.

“A few months ago, we published a study showing that these models are comparable to physician at the task of determining clinical acuity for triage in the ED,” Williams explained. “With this study, we wanted to go a step further and evaluate their performance for providing clinical recommendations: admission, imaging needed and antibiotics.”

In the current analysis, the researchers examined the accuracy of ChatGPT-3.5-turbo’s and ChatGPT-4-turbo’s clinical recommendations for patients in the ED compared with physicians.

Sets of 10,000 ED visits were compiled for each clinical decision, which were taken from a sample of 251,401 visits.

According to a press release, Williams and colleagues entered the provider’s notes on patient symptoms and examination findings into the LLMs, testing the accuracy of each set.

They found that the overall performance of the AI was lower than that of humans across all three tasks: ChatGPT-3.5-turbo was 24% less accurate than physicians, whereas ChatGPT-4-turbo was 8% less accurate.

The researchers noted the results may reflect the complexity of clinical decision-making, where recommendations may be impacted by several factors such as patient preference, the current availability of resources and social determinants of health.

“We were surprised to find that these LLMs were overly cautious in their clinical recommendations,” Williams said. “Both GPT-3.5-turbo and GPT-4-turbo exhibited a tendency to recommend intervention, which led to a significant number of false positive suggestions.”

He added that this tendency “may be problematic given the need to both prioritize hospital resource availability and reduce overall health care costs.”

“It is an important phenomenon to be aware of — the overarching message is not to blindly trust the suggestions of these models,” he said.

The performance of ChatGPT-4-turbo varied across the recommendations. For example, it showed 25% lower accuracy for patient admission, 4% lower accuracy for radiological investigations and 5% greater accuracy for antibiotic prescriptions vs. physicians.

There were some study limitations. The researchers noted it was possible that not all the information that led to the real-life recommendation was available in the presenting history and physical examination sections of the ED physician note.

Williams noted that there are a couple different areas that future research could examine.

For example, “how do we train these models to do a better job at this particular task, providing clinical recommendations?” he said. “Is there a way to better incorporate clinical guidelines into the decision-making process to improve their accuracy.”

References:

When it comes to emergency care, ChatGPT overprescribes. Available at: https://www.ucsf.edu/news/2024/10/428591/when-it-comes-emergency-care-chatgpt-overprescribes. Published Oct. 8, 2024. Accessed Oct. 15, 2024.
Williams C, et al. Nat Commun. 2024;doi:10.1038/s41467-024-52415-1.

Perspective

Gerard A. Baltazar, DO, FACOS, FACS

This fascinating study analyzes the accuracy and precision of AI to determine whether an ED patient should be admitted to the hospital, requires imaging studies, or needs antibiotics. Using up to an impressive 5,000 patients in some study cohorts, the authors demonstrate that AI (in this case, ChatGPT-4-turbo and ChatGPT-3.5-turbo) performed poorly compared to physicians. AI ordered more admissions, studies and antibiotics compared to humans who assessed the same aspects of the electronic health record (EHR). AI’s overuse of resources raised concerns about unnecessary healthcare expenditures and whether the inherent complexity of clinical decision-making exceeds the abilities of these latest iterations of AI.

There is no question that computer-based algorithms and AI have and will have profound effects on how medicine is practiced. However, to what extent do we allow computers to make clinical decisions at the bedside remains a huge open question. The EHR only contains some of the clinical picture, and overall clinical impressions are derived from far more than what is written down, including non-verbal impressions garnered from physical interaction or detailed information patients provide in confidence they may want recorded minimally or not at all.

The authors accurately conclude that further evolution of AI, as well as our ability to more effectively interrogate AI, may lead to improved clinical output. There are some aspects of the clinical encounter and the make of a physician which may profoundly transcend even AI most advanced logic. Nonetheless, this study inspires hope that at least some of the simpler tasks required of overworked physicians may be, in time, delegated safely to machines.

Gerard A. Baltazar, DO, FACOS, FACS

Osteopathic physician specializing in trauma, emergency general surgery and surgical critical care
Member, American Osteopathic Association

Disclosures: Baltazar reports no relevant financial disclosures.

Sources/Disclosures

Collapse

Source:

Williams C, et al. Nat Commun. 2024;doi:10.1038/s41467-024-52415-1.

Disclosures: Williams reports no relevant financial disclosures. Please see the study for all other authors’ relevant financial disclosures.

AI in Medicine

Read more

‘Overly cautious’ ChatGPT inaccurate at prescribing correct treatment

Key takeaways:

References:

Perspective

Related Content