Read more

October 15, 2024
2 min read
Save

‘Overly cautious’ ChatGPT inaccurate at prescribing correct treatment

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • LLMs performed poorly overall across three clinical recommendations compared with ED physicians.
  • The models were overly cautious, while this tendency is not always appropriate for ED settings, a researcher said.

ChatGPT may be more likely to recommend unnecessary treatment, such as x-rays or antibiotics, during ED visits compared with physicians, a study results published in Nature Communications suggested.

These inaccuracies could ultimately led to higher costs and potentially harm patients,, while further development on learning language models (LLMs) in these types of care settings is needed, according to one of the study’s co-authors.

PC1024Williams_Graphic_01_WEB
Data derived from: Williams C, et al. Nat Commun. 2024;doi:10.1038/s41467-024-52415-1.

“Since ChatGPT came out in 2022, there has been a lot of interest in how these models can be used in health care,” Christopher Y. K. Williams, MB, BChir, a postdoctoral scholar at the University of California, San Francisco, told Healio. “Before any new technology is deployed in clinical practice, it is important to evaluate what are its benefits and — just as importantly — limitations.”

ChatGPT models have shown potential for several uses in clinical practice, such as supporting clinical decision-making and summarizing medical abstracts.

“A few months ago, we published a study showing that these models are comparable to physician at the task of determining clinical acuity for triage in the ED,” Williams explained. “With this study, we wanted to go a step further and evaluate their performance for providing clinical recommendations: admission, imaging needed and antibiotics.”

In the current analysis, the researchers examined the accuracy of ChatGPT-3.5-turbo’s and ChatGPT-4-turbo’s clinical recommendations for patients in the ED compared with physicians.

Sets of 10,000 ED visits were compiled for each clinical decision, which were taken from a sample of 251,401 visits.

According to a press release, Williams and colleagues entered the provider’s notes on patient symptoms and examination findings into the LLMs, testing the accuracy of each set.

They found that the overall performance of the AI was lower than that of humans across all three tasks: ChatGPT-3.5-turbo was 24% less accurate than physicians, whereas ChatGPT-4-turbo was 8% less accurate.

The researchers noted the results may reflect the complexity of clinical decision-making, where recommendations may be impacted by several factors such as patient preference, the current availability of resources and social determinants of health.

“We were surprised to find that these LLMs were overly cautious in their clinical recommendations,” Williams said. “Both GPT-3.5-turbo and GPT-4-turbo exhibited a tendency to recommend intervention, which led to a significant number of false positive suggestions.”

He added that this tendency “may be problematic given the need to both prioritize hospital resource availability and reduce overall health care costs.”

“It is an important phenomenon to be aware of — the overarching message is not to blindly trust the suggestions of these models,” he said.

The performance of ChatGPT-4-turbo varied across the recommendations. For example, it showed 25% lower accuracy for patient admission, 4% lower accuracy for radiological investigations and 5% greater accuracy for antibiotic prescriptions vs. physicians.

There were some study limitations. The researchers noted it was possible that not all the information that led to the real-life recommendation was available in the presenting history and physical examination sections of the ED physician note.

Williams noted that there are a couple different areas that future research could examine.

For example, “how do we train these models to do a better job at this particular task, providing clinical recommendations?” he said. “Is there a way to better incorporate clinical guidelines into the decision-making process to improve their accuracy.”

References: