ChatGPT shows potential at accurately summarizing medical abstracts, researchers find
Click Here to Manage Email Alerts
Key takeaways:
- On a scale of 0 to 100, ChatGPT scored a median 90 on its quality of medical abstract summaries.
- Summaries could be a useful tool for busy clinicians to decide whether to review literature further.
ChatGPT produced high-quality and accurate summaries of medical abstracts but struggled to classify the relevance of abstracts to medical specialties, a study published in the Annals of Family Medicine showed.
“Care models emphasizing clinical productivity leave clinicians with scant time to review the academic literature, even within their own specialty,” Joel Hake, MD, an assistant professor of family medicine and community health at the University of Kansas Medical Center, and colleagues wrote. “Recent developments in artificial intelligence and natural language processing might offer new tools to confront this problem.”
The researchers suggested that learning language models (LLMs) like ChatGPT may be able to efficiently help physicians review literature by creating summaries of medical abstracts, “focusing on points that were most likely to be salient for practicing physicians.”
In the study, Hake and colleagues compiled 140 peer-reviewed abstracts across 14 journals and rated ChatGPT 3.5’s summaries of them based on quality, accuracy and the amount of bias.
Each characteristic was graded on a scale of 0 to 100 by seven physicians.
The researchers also examined ChatGPT 3.5’s ability to classify the relevance of journals and abstracts to specific medical specialties like neurology, cardiology and gynecology.
Overall, ChatGPT 3.5’s summaries were 70% shorter than abstracts, decreasing the mean character count to 739.
The summaries were rated as having:
- high quality (median score = 90; interquartile range [IQR], 87-92.5);
- high accuracy (median score = 92.5; IQR, 89-95); and
- low bias (median score = 0; IQR, 0-7.5).
Serious inaccuracies and hallucinations, defined by Hake and colleagues as changes to the major interpretation of the abstract, occurred in only four of the summaries.
Minor inaccuracies were found in 20 summaries and were either related to ambiguity in meaning or summarization of details that would have provided additional content but not change the meaning.
Hake and colleagues added that ChatGPT 3.5 was able to classify the relevance of journals to various specialties, but it was less able to classify the relevance of specific abstracts to specialties.
“We hope that in future iterations of LLMs, these tools will become more capable of relevance classification,” they wrote.
Some limitations of the research were that the abstracts — from a limited number of journals —focused only on clinical medicine, and were only primary research reports, systematic reviews and meta-analyses.
Based on the findings, the summaries “are likely to be useful as a screening tool to help busy clinicians and scientists more rapidly evaluate whether further review of an article is likely to be worthwhile,” Hake and colleagues wrote.