ChatGPT correctly responds to most patient education questions on thyroid nodules
Click Here to Manage Email Alerts
Key takeaways:
- ChatGPT correctly responded to 69.2% of questions regarding patient education of thyroid nodules.
- The mean responses issued by ChatGPT were at a college reading level.
ChatGPT correctly answered more than two-thirds of submitted questions about thyroid nodules, though its responses may be at a grade level too high for patient education, according to study findings published in Thyroid.
“Whether the authors think that ChatGPT could be reliably used by patients or not, we find it inevitable that it will be,” Daniel J. Campbell, MD, an otolaryngologist in the department of otolaryngology – head and neck surgery at Thomas Jefferson University Hospitals in Philadelphia, told Healio | Endocrine Today. “ChatGPT is the fastest growth online application to date, and artificial intelligence’s integration into everyday life is becoming stronger by the day. This is why we feel this research is so important. It is paramount that clinicians understand the current capabilities and limitations of the software.”
Campbell and colleagues queried the April 2, 2023, version of ChatGPT four times with an identical set of 30 sequential questions pertaining to patient education on thyroid nodules. The queries included five questions on epidemiology, 10 on diagnosis, five on prognosis and 10 on management. ChatGPT’s answers were reviewed based on medical accuracy and clinical appropriateness and were graded as incorrect, partially correct, correct or correct with a reference. Responses were graded by two otolaryngology resident physicians based on the most recent guidelines and statements from the American Thyroid Association, with a third author deciding the grade in cases of discrepancies. A Flesch-Kincaid grade level score was generated for each response. A grade level of seven corresponded to middle school level, 10 was defined as high school level and 14 was defined as collegiate level.
Of the 120 responses generated by ChatGPT, 47.5% were correct, 21.7% were correct with a reference provided, 28.3% were partially correct and 2.5% were incorrect. Of the partially correct responses, 76.5% were deemed incomplete and the remainder were described as “too vague.”
ChatGPT provided 84 references across all 120 questions, of which 95.2% were from published medical literature and the remainder were from academic medical organization websites. Of the references, 12.5% were unable to be found or incorrect, whereas the remainder were legitimate citations. ChatGPT extracted accurately reported information from 82.9% of the legitimate citations. In the remaining 17.1%, ChatGPT incorrectly stated or completed falsified findings from the reference.
“AI hallucination is a well-documented phenomenon in large language models — a phenomenon that was indeed evident in this current investigation,” Campbell said.
The mean grade level was 14.97 for all responses, 14.05 when patient-friendly prompting was used and 13.43 when eighth grade level questions were written. When a prompt for references was written, mean grade level increased to 16.43.
In a secondary analysis, researchers queried the Sept. 25, 2023, version of ChatGPT to assess its ability to issue lower grade level responses. No differences were observed in eighth grade level prompting responses from the April version of ChatGPT. Grade level responses decreased when prompts were written at a fourth grade and sixth grade reading level compared with eighth grade level prompting. However, Campbell said the fourth and sixth grade level responses from ChatGPT were still at a reading level higher than the AMA’s recommendations for presenting patient information.
“Clinically, it would be very useful for future studies to delve into optimal prompting strategies for ChatGPT,” Campbell said. “As I find it inevitable that a subset of patients will use the chatbot for medical information, it would be helpful if we could provide patients with pre-question prompts to maximize its utility for patient education.”
Campbell also said future studies should compare ChatGPT with other chatbots, such as GPT-4, Google Bard and Bing Chat, to see if any of them are best for certain aspects of patient education.
For more information:
Daniel J. Campbell, MD, can be reached at daniel.campbell2@jefferson.edu.