ChatGPT-4 accurately interprets ophthalmic images

ByCassandra Jenkins

Fact checked byHeather Biele

Add topic to email alerts

Please provide your email address to receive an email when new articles are posted on Technology in Optometry.

Key takeaways:

ChatGPT-4 correctly answered most multiple-choice questions pertaining to image recognition in ophthalmic cases.
The chatbot performed better on nonimage-based vs. image-based questions.

Perspective from Joe L. Wheat, OD, PhD, FAAO

The latest version of an artificial intelligence chatbot accurately responded to 70% of multiple-choice questions pertaining to ophthalmic cases based on imaging interpretation, according to a study published in JAMA Ophthalmology.

“As the use of multimodal chatbots becomes increasingly widespread, it is imperative to stress their appropriate integration within medical contexts,” Andrew Mihalache, MD candidate at the Temerty School of Medicine of University of Toronto in Ontario, Canada, and colleagues wrote.

"It is impressive that the artificial intelligence chatbot was able to correctly answer approximately two-thirds of multiple-choice questions pertaining to multimodal ophthalmic images." Rajeev H. Muni, MD, MSc, FRCSC

In a cross-sectional study including 136 ophthalmic cases and 448 images provided by the medical education platform OCTCases, researchers evaluated the performance of ChatGPT-4 (OpenAI), an AI chatbot capable of processing ophthalmic imaging data. They used multiple-choice questions in the statistical analysis rather than open-ended questions to allow for objective grading of the chatbot’s responses.

The primary endpoint was the chatbot’s accuracy in answering multiple-choice questions pertaining to image recognition in ophthalmic cases — organized into categories including retina, neuro-ophthalmology, uveitis, glaucoma, ocular oncology and pediatric ophthalmology — measured as the proportion of correct responses, according to the researchers.

Secondary endpoints included the differences in the chatbot’s performance on image- vs. nonimage-based questions, as well as the association between the number of images inputted and the proportion of multiple-choice questions answered correctly per case.

Researchers conducted χ² tests to compare proportions of correct responses across different ophthalmic subspecialties.

Of the 429 multiple-choice questions included in the analysis, ChatGPT-4 answered 299 (70%) correctly across all cases.

“Given the complexity of ophthalmic image interpretation, it is impressive that the artificial intelligence chatbot was able to correctly answer approximately two-thirds of multiple-choice questions pertaining to multimodal ophthalmic images,” Rajeev H. Muni, MD, MSc, FRCSC, co-author and vice chair of clinical research in the university’s department of ophthalmology and vision sciences, told Healio. “Over time, improvements in AI models could improve the chatbot’s accuracy considerably.”

The chatbot performed better on retina questions than neuro-ophthalmology questions (77% vs. 58%; difference, 18%; χ²₁ = 11.4; P < .001).

The chatbot’s performance also appeared better on nonimage-based questions compared with image-based questions (82% vs. 65%; difference, 17%; χ²₁ = 12.2; P < .001).

Additionally, the chatbot demonstrated intermediate performance on questions from the ocular oncology (72% correct), pediatric ophthalmology (68% correct), uveitis (67% correct) and glaucoma (61% correct) categories.

“Our findings show that the artificial intelligence chatbot has the potential to serve as a valuable educational resource for clinicians and trainees one day, as it is capable of identifying and interpreting abnormalities present on ophthalmic imaging modalities with moderate accuracy,” Muni said. “As its knowledge and sophistication advances, the artificial intelligence chatbot may eventually play a role in clinical decision-making.”

The researchers noted that their present investigation, which assessed the AI chatbot’s performance on multiple-choice questions pertaining to multimodal ophthalmic cases, may not translate to its real-world clinical utility.

“Our team’s future work aims to assess the chatbot’s diagnostic accuracy on ophthalmic imaging cases without the use of multiple-choice prompts,” Muni added. “We also aim to compare the performance of the AI chatbot relative to board-certified ophthalmologists.”

Perspective

Joe L. Wheat, OD, PhD, FAAO

Multimodal large language models (LLMs) have seemingly unlimited potential when it comes to applications in health care. This research demonstrates the use of multimodal input into ChatGPT-4 and its ability to answer a series of image-based and text questions using a multiple-choice testing dataset. Glaucoma questions in the dataset were predominantly related to OCT interpretation.

While the chatbot’s performance was not overly impressive, it is noteworthy given this is a publicly available, untrained system. The authors submit that the overall performance did not approach the level of prior studies using trained AI systems for image analysis, but given the relatively recent addition of multimodal adaptation of ChatGPT, its performance in this regard may increase with usage.

The authors also emphasize that this is strictly a demonstration of how ChatGPT performs on a given format of questions based on images and text; there is no indication of the accuracy outside this dataset with actual clinical data.

The use of patient data in public LLMs has many methodological, ethical and medicolegal concerns that have not been addressed, and application in this manner should be approached cautiously with those considerations in mind.

Joe L. Wheat, OD, PhD, FAAO

Associate professor, University of Houston College of Optometry

Member, Optometric Glaucoma Society

Disclosures: Wheat reports no relevant financial disclosures.

Published by:

Sources/Disclosures

Collapse

Source:

Mihalache A, et al. JAMA Ophthalmol. 2024;doi:10.1001/jamaophthalmol.2024.0017.

Disclosures: Mihalache reports no relevant financial disclosures. Muni reports consultant fees from AbbVie, Alcon, Apellis, Bausch Health, Bayer, Novartis and Roche outside the submitted work and support from the Silber TARGET fund. Please see the study for all other authors’ relevant financial disclosures.