AI tools require precise language, content formulation to better assist medical education

ByRobert Herpen, MA

Fact checked byShenaz Bagha

Key takeaways:

Researchers examined ChatGPT performance on a neurosurgical test vs. human participants.
The AI tool finished sixth out of 11 participants on the 60-minute exam.

AI-generative tools such as ChatGPT may be a useful tool for medical education but require precise formulation of language and content to match and assist skills of human counterparts, according to new research from Brain and Spine.

“Artificial intelligence tools create shockwaves in all fields of daily life, amongst blind enthusiasm, hyperbolic headlines and apocalyptic fear,” Andrea Bartoli, MD, a consultant neurosurgeon in the department of clinical neurosciences and division of neurosurgery at Geneva University Medical Center, and colleagues wrote. “Medical science has not been spared and ChatGPT has been tested in various medical fields.”

Machine learning AI213593664 — Recent research determined that AI-based tools, such as ChatGPT, require precision in language and content to better assist with medical education. Image: Adobe Stock

Bartoli and colleagues attempted to ascertain the performance of ChatGPT, a generative language tool, at both generating questions and answering a neurosurgical residents’ written exam from the University of Geneva, as well as to assess differences in residents’ response rate to ChatGPT-generated questions.

Their study included 50 questions — both open-ended and multiple choice — from the standard written exam, with 46 questions generated by humans and four by ChatGPT,. Eleven participants (ChatGPT and 10 residents) took the exam with a 60-minute time limit.

According to results, ChatGPT scored 21.4, ranking sixth out of the 11 test-takers, and answered all four of its self-generated questions correctly. For one question about pelvic parameter, all 10 residents interpreted it correctly, while ChatGPT did not and got the answer wrong.

There was no difference in response rate between human-generated questions compared with AI-generated questions that could have been attributed to a lack of clarity.

“The next logical experience in this field would be to compare results and rankings of neurosurgical residents when taking a fully AI-generated exam vs. a fully human-generated exam,” Bartoli and colleagues wrote.

Sources/Disclosures

Collapse

Source:

Bartoli A, et al. Brain and Spine. 2024;doi:10.1016/j.bas.2023.102715.

Disclosures: The authors report no relevant financial disclosures.