Fact checked byHeather Biele

Read more

February 22, 2024
1 min read
Save

AI chatbot performance ‘eye-opening’ in accuracy, completeness vs. ophthalmologists

Fact checked byHeather Biele
You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • The LLM chatbot scored 506.2 in accuracy, while glaucoma specialists scored 403.4.
  • The chatbot scored 235.3 in accuracy, while retina specialists scored 216.1.

A large language model chatbot matched and even outperformed fellowship-trained ophthalmologists in diagnostic and treatment accuracy of retina and glaucoma cases, according to new research published in JAMA Ophthalmology.

“The findings are crucial as they highlight the potential of AI as a support tool in medical diagnostics,” Andy S. Huang, MD, a resident physician at the New York Eye and Ear Infirmary of Mount Sinai, told Healio. “If AI can effectively assist or even match specialists, it can revolutionize or drastically shift the current health care delivery, offering support in decision-making.”

PCON0224Huang_Graphic_01_WEB

In a comparative cross-sectional study, Huang and colleagues recruited 15 participants aged 31 to 67 years, including 12 attending physicians and three senior trainees, to compare the diagnostic and treatment accuracy of artificial intelligence-generated responses against those of fellowship-trained ophthalmologists.

Participants and GPT-4 (OpenAI), a large language model (LLM), answered clinical questions and provided case-management assessments for 20 deidentified glaucoma and retinal cases seen at clinics affiliated with Mount Sinai. Researchers used a Likert scale to assess those answers for medical accuracy and completeness.

According to results, the LLM chatbot scored a combined question-case mean rank accuracy of 506.2 and a mean rank for completeness of 528.3, while glaucoma specialists scored 403.4 and 398.7, respectively (P < .001). Compared with retina specialists, who scored 216.1 for accuracy and 208.7 for completeness, the chatbot scored 235.3 and 258.3.

Using the Dunn’s post-hoc pairwise comparison test, the researchers reported that trainees and specialists rated the chatbot more favorably in accuracy and completeness compared with fellow ophthalmologists.

“The performance of GPT-4 in this study was quite eye-opening,” Huang told Healio. “It was fascinating to see that ChatGPT not only can assist, but in some cases matched or exceeded the expertise of seasoned ophthalmology specialists.”

He continued, “While we want to proceed with extreme caution and will need additional careful testing, the next step would be to integrate this technology responsibly and ethically into enhancing patient care.”