Issue: January 2025
Fact checked byErik Swain

Read more

November 07, 2024
2 min read
Save

GPT-4 responses to anti-obesity medication questions comparable to FDA answers

Issue: January 2025
Fact checked byErik Swain
You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • Physicians had difficulty distinguishing GPT-4 responses from FDA responses about obesity medications.
  • The physicians thought GPT-4 responses were more accurate but FDA responses were better communicated.

SAN ANTONIO — When asked common questions regarding anti-obesity medications, GPT-4 responses were easy to interpret, accurate and comparable to responses provided by the FDA, according to findings presented at ObesityWeek.

“Large language models refer to a specific type of machine learning, or more often, a deep learning algorithm that’s trained on massive amounts of text, often all of the published literature that's been available in history, and it uses this to then create a predictive model that can mimic human speech,” Thomas W. Fredrick, MD, gastroenterology fellow in the division of gastroenterology and hepatology at Mayo Clinic Minnesota, said during the presentation. “The model we used in this study, ChatGPT, uses an algorithm GPT-4 and it receives up to 600 million visits per month, so you can be sure that your patients, and basically almost everyone you work with, is using and interacting with these on a daily basis.”

PCON0224RetinAI_Graphic_01
Physicians had difficulty distinguishing GPT-4 responses from FDA responses about obesity medications. Image: Adobe Stock.

Fredrick and colleagues conducted a comparative analysis to assess GPT-4 generated responses to common anti-obesity medication questions. In October 2023, researchers prompted GPT-4 large language models with commonly asked questions from patients about FDA-approved anti-obesity medications including liraglutide (Saxenda, Novo Nordisk), semaglutide (Wegovy, Novo Nordisk), phentermine/topiramate ER (Qsymia, Vivus) and bupropion/naltrexone SR.

“These large language models are becoming increasingly prevalent in society, and we expect that our patients will be turning to large language models with questions about their medications rather than FDA-provided information,” Fredrick said during the presentation. “Our aim was to investigate how we as board-certified medical providers evaluate these large language model-composed responses and compare them to the FDA-generated responses regarding questions that our patients commonly ask about anti-obesity medications.”

Prompts evaluated mechanism of action, serious adverse events, missing drug doses and expected weight loss related to anti-obesity medications. Researchers compared these responses with those found from FDA patient information sheets. Ten random responses from the FDA and 20 from GPT-4 were graded by 10 blinded physicians on accuracy, whether they thought a response was AI-generated and the response’s usability for patient communication.

Overall, the physicians accurately identified AI-generated responses in 46% of cases while accurately identifying 54% of responses as FDA responses, Fredrick said.

“We can only accurately tell this pretty much as well as a coin flip,” he said. “We as providers are not the best at discerning the FDA language compared to the GPT-4 language.”

The physicians deemed GPT-4 responses as more accurate compared with FDA responses (Likert scale scores, 3.18 vs. 3.87; P < .001). A subanalysis did not find that any particular medications or questions were outliers, Fredrick said.

More than half (55%) of physician reviewers preferred FDA responses in their patient communication vs. 39% that preferred GPT-4 responses, but the difference was not statistically significant, Fredrick said.

“Neither of these are particularly great if you are trying to draft a patient communication,” he said. “For that, you would want something 80% to 100% of the time.” There were no outliers in terms of specific medications or questions, he said.

“It appears GPT-4 responses can generate information that appears, when providers assess it, to be accurate,” Fredrick said. “We recommend validating all AI technologies. In general, best practices for patient communication is that [responses] should ideally be at the sixth grade reading level. Brief is better, and make sure that it's accurate.”