Fact checked byShenaz Bagha

Read more

December 18, 2024
2 min read
Save

GPT-4’s rheumatology answers more accurate vs other LLMs, still have ‘possibility of harm'

Fact checked byShenaz Bagha

WASHINGTON — Despite answering 78% of rheumatology questions correctly, besting two other popular large language models, GPT-4 nonetheless introduced the “possibility of harm” in up to 20% of its responses, according to data.

“In the medical field, large language models (LLMs) have many applications, including research, programming, science communication, et cetera; as well as in patient care, in services such as translation and reviewing patient charts; and in education, among many other functions,” Jaime Flores-Gouyonnet, MD, of the Mayo Clinic, told attendees at ACR Convergence 2024. “However, LLMs have been widely reported to ‘hallucinate,’ creating content that can be false or misleading. Therefore, we aimed to analyze not only the accuracy of large language models, but also the quality and the safety of their answers.”

artificial intelligence
“It’s very important that patients and practitioners that are not experts in the area take very cautiously the answers that LLMs provide,” Jaime Flores-Gouyonnet, MD, said. Image: Adobe Stock

To assess the capabilities of large language models to recall accurate information related to rheumatology, Flores-Gouyonnet and colleagues submitted 40 multiple-choice questions from the American College of Rheumatology’s CARE-2022 Question Bank, 10 of which contained images, through three different language models. These included GPT-4, Claude 3: Opus and Gemini Advanced.

Five board-certified rheumatologists, blinded to which model provided which answer, evaluated the quality and safety of the answers across seven domains: scientific consensus, evidence of comprehension, evidence of retrieval, evidence of reasoning, inappropriate/incorrect content, missing content and possibility of harm.

The researchers used a five-element Likert scale to assess the first six domains, while “possibility of harm” was assessed as mild, moderate or severe when models provided an incorrect answer.

According to the researchers, all three models answered most of the 40 questions correctly. GPT-4 preformed best, answering 78% of questions correctly, followed by Claude 3: Opus with 63%, and Gemini Advanced finishing last with 53%.

In addition, GPT-4 and Claude 3: Opus both answered eight out of 10 image questions correctly, while Gemini Advanced answered only three correctly.

Incorrect answers across all three models had similar rates of harm possibility, with 71% for GPT-4, 73% for Claude 3: Opus, and 75% for Gemini Advanced. However, those from Gemini Advanced had the greatest rate of “severe harm” possibility, 52%.

Overall, the total number of answers rated as having harm potential were 11 for Claude 3 (27.5%), six for Gemini Advanced (15%) and five for GPT-4 (12.5%).

“It’s very important to mention this — the proportion of answers having positive potential of harm was higher for Claude 3 than for the other two, but the difference was not statistically significant,” Flores-Gouyonnet said. “They more or less performed the same.”

Further data revealed that GPT-4 bested Gemini Advanced in all seven evaluated domains and was superior to Claude 3: Opus in terms of scientific consensus (P < .001), evidence of reasoning (P < .001), inappropriate/incorrect content (P = .007) and missing content (P = .011). Gemini Advanced was also inferior to Claude 3: Opus in terms of scientific consensus (P = .01), evidence of comprehension (P < .001), evidence of retrieval (P < .001), evidence of reasoning (P < .001) and missing content (P < .001).

“We can conclude here that ChatGPT outperformed the other two LLMs,” Flores-Gouyonnet said. “It was more accurate and had better quality answers while keeping the same safety, in general.

“It’s very worth noticing that even the best performing model had answers rated as having positive possibility of harm in up to 20% of times, and 10% of those answers were rated as having severe possibility of harm,” he added. “Consequently, it’s very important that patients and practitioners that are not experts in the area take very cautiously the answers that LLMs provide, because if they are not experts, they can take decisions that will lead to severe harm.”