Read more

February 14, 2024
2 min read
Save

Study: Chatbots may provide misinformation in clinical management of orthopedic conditions

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • Chatbots had limitations by deviating from the standard of care and omitting critical steps in workups.
  • The chatbots used an oversampling of a small number of references and provided faulty links.

SAN FRANCISCO — Large language model chatbots may provide misinformation and inaccurate musculoskeletal health information to patients, according to data presented at the American Academy of Orthopaedic Surgeons Annual Meeting.

Branden Rafael Sosa, a fourth-year medical student at Weill Cornell Medicine, and colleagues analyzed the validity and accuracy of the information for orthopedic procedures that large language model chatbots provided to patients. They also assessed how the chatbots explained basic orthopedic concepts, integrated clinical information into decision-making and addressed patient queries.

OT0224Sosa_AAOS_Graphic_01
Data were derived from Sosa BR, et al. Poster e067. Presented at: American Academy of Orthopaedic Surgeons Annual Meeting; Feb. 12-16, 2024; San Francisco.

In the study, Sosa and colleagues prompted Open AI ChatGPT 4.0, Google Bard and BingAI chatbots to each answer 45 orthopedic-related questions in the categories of “bone physiology,” “referring physician” and “patient query.” Two independent, masked reviewers scored responses on a scale of zero to four, assessing accuracy, completeness and useability.

Branden Rafael Sosa
Branden Rafael Sosa
Matthew B. Greenblatt
Matthew B. Greenblatt

Researchers analyzed the responses for strengths and limitations within categories and among the chatbots. They found that when prompted with orthopedic questions, OpenAI ChatGPT, Google Bard and BingAI provided correct answers that covered the most critical salient points in 77%, 33% and 17% of queries, respectively. When providing clinical management suggestions, all chatbots displayed significant limitations by deviating from the standard of care and omitting critical steps in workup, such as ordering antibiotics before cultures or neglecting to include key studies in diagnostic workup.

“I think clinical context is one of the things that they struggled with most and particularly when coming up with an assessment or a plan for patient who presents with infection. Oftentimes, they forgot to get cultures before initiating antibiotics, forgetting to order radiographs and the workup of a patient with hip osteoarthritis, or to the point of seminal papers that highlight changes in the way that treatment is delivered,” Sosa told Healio.

When asked less complex patient queries, the researchers found ChatGPT and Google Bard provided mostly accurate responses, but often failed to elicit critical medical history pertinent to fully address the query. Additionally, careful analysis of citations provided by chatbots revealed an oversampling of a small number of references and 10 faulty links that were nonfunctional or led to incorrect articles.

“I would say that in certain applications, AI chatbots, in particular ChatGPT, performed pretty well. It was able to give clinically useful information in majority of cases, broadly speaking. But that generally good performance carries with it some significant risks as well,” Matthew B. Greenblatt, MD, PhD, an associate professor of pathology and laboratory medicine, Weill Cornell Medicine, and co-author of the study, told Healio.

Greenblatt said there was a subset of patient questions or clinician questions for which the information provided was not accurate enough to be used clinically.

“It's not in [the] majority of cases, but it was enough that it would be significantly problematic for real-world deployment if this was really used as a main line tool,” Greenblat said.

Greenblatt said results of this study highlight the importance of oversight by subject matter experts in using large language models chatbots in clinical contexts.

“It could potentially be a timesaver or helpful in summarizing information. When all of that is overseen and checked by someone who is truly an expert, one can be well aware of where the chatbot led astray,” Greenblatt said.