Read more

February 14, 2025
1 min read
Save

ChatGPT 3.5 and 4 exceeded the proficiency threshold on AAOS shoulder, elbow examination

Key takeaways:

  • ChatGPT 3.5 and ChatGPT 4 exceeded the proficiency threshold for questions on the AAOS shoulder-elbow examination.
  • ChatGPT may not be accurate enough to replace clinical decision-making.

ChatGPT 3.5 and ChatGPT 4, by OpenAI, exceeded the proficiency threshold for written questions on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment examination, according to published results.

However, researchers noted ChatGPT is not accurate or reliable enough to replace clinical decision-making.

artificial intelligence
ChatGPT 3.5 and ChatGPT 4 exceeded the proficiency threshold for questions on the AAOS shoulder-elbow examination. Image: Adobe Stock

“ChatGPT 3.5 and ChatGPT 4 answered more than half of the text-based AAOS shoulder-elbow [self-assessment examination] SAE questions correctly, a percentage sufficient to obtain CME credit,” Benjamin Nieves-Lopez, BS, undergraduate researcher at the University of Puerto Rico Medical Sciences Campus, and colleagues wrote in the study. “ChatGPT 4 significantly outperformed ChatGPT 3.5, representing the dynamic learning potential of [large language models] LLMs.”

Nieves-Lopez and colleagues tested the abilities of ChatGPT 3.5 and ChatGPT 4 to answer 86 text-based questions from the 2019 and 2021 AAOS shoulder-elbow self-assessment examination.

Overall, ChatGPT 3.5 answered 52.3% (n = 45) of the questions correctly, while ChatGPT 4 answered 73.3% (n = 63) of the questions correctly. Nieves-Lopez and colleagues noted both correct response rates exceeded the 50% threshold for obtaining CME credit.

ChatGPT 4 performed better in anatomy (61.5% vs. 30.8%), arthroplasty (76% vs. 60%), basic science (50% vs. 25%), nonoperative (75% vs. 25%) and trauma (81.8% vs. 36.4%) compared with ChatGPT 3.5.

“Further refinement of ChatGPT’s training may improve its performance and utility as a resource,” Nieves-Lopez and colleagues wrote. “Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making.”