ChatGPT 3.5 and 4 exceeded the proficiency threshold on AAOS shoulder, elbow examination

ByMax R. Wursta

Fact checked byGina Brockenbrough, MA

Add topic to email alerts

Please provide your email address to receive an email when new articles are posted on Intervention.

Key takeaways:

ChatGPT 3.5 and ChatGPT 4 exceeded the proficiency threshold for questions on the AAOS shoulder-elbow examination.
ChatGPT may not be accurate enough to replace clinical decision-making.

Perspective from Matthew A. Butler, MD

ChatGPT 3.5 and ChatGPT 4, by OpenAI, exceeded the proficiency threshold for written questions on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment examination, according to published results.

However, researchers noted ChatGPT is not accurate or reliable enough to replace clinical decision-making.

artificial intelligence — ChatGPT 3.5 and ChatGPT 4 exceeded the proficiency threshold for questions on the AAOS shoulder-elbow examination. *Image: Adobe Stock*

“ChatGPT 3.5 and ChatGPT 4 answered more than half of the text-based AAOS shoulder-elbow [self-assessment examination] SAE questions correctly, a percentage sufficient to obtain CME credit,” Benjamin Nieves-Lopez, BS, undergraduate researcher at the University of Puerto Rico Medical Sciences Campus, and colleagues wrote in the study. “ChatGPT 4 significantly outperformed ChatGPT 3.5, representing the dynamic learning potential of [large language models] LLMs.”

Nieves-Lopez and colleagues tested the abilities of ChatGPT 3.5 and ChatGPT 4 to answer 86 text-based questions from the 2019 and 2021 AAOS shoulder-elbow self-assessment examination.

Overall, ChatGPT 3.5 answered 52.3% (n = 45) of the questions correctly, while ChatGPT 4 answered 73.3% (n = 63) of the questions correctly. Nieves-Lopez and colleagues noted both correct response rates exceeded the 50% threshold for obtaining CME credit.

ChatGPT 4 performed better in anatomy (61.5% vs. 30.8%), arthroplasty (76% vs. 60%), basic science (50% vs. 25%), nonoperative (75% vs. 25%) and trauma (81.8% vs. 36.4%) compared with ChatGPT 3.5.

“Further refinement of ChatGPT’s training may improve its performance and utility as a resource,” Nieves-Lopez and colleagues wrote. “Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making.”

Perspective

Matthew A. Butler, MD

In my own trialing of ChatGPT, I am impressed with the speed and detail with which questions can be answered but often disappointed with the nuance and cohesion of answers.

It was interesting that this study indicated that ChatGPT 3.5 performed relatively poorly in anatomy compared with a category like shoulder instability. The former of which is seemingly more concrete and fact-based, while the latter is more reliant on literature and practice trends, which emerge and change over time. This seems to contradict the concept that performance would be negatively impacted on questions requiring a deeper level of reasoning.

It is encouraging that ChatGPT 4 showed a higher level of performance than ChatGPT 3.5, indicating improvements in the technology and learning.

At present, I find the technology interesting and perhaps promising. It may be a good starting point for patients to stimulate thought and discussion in a clinical setting; however, it is not a safe or adequate substitute for answering patient questions or for clinical decision-making.

Matthew A. Butler, MD

Hand, upper extremity surgeon

Hospital for Special Surgery

New York

Disclosures: Butler reports no relevant financial disclosures.

Published by:

Sources/Disclosures

Collapse

Source:

Nieves-Lopez B, et al. Orthopedics. 2025;doi:10.3928/01477447-20250123-03.

Disclosures: Nieves-Lopez reports no relevant financial disclosures. Please see the study for all other authors’ relevant financial disclosures.