Patients rate ChatGPT health care descriptions as more credible than those from surgeons

ByCasey Tingle

Fact checked byGina Brockenbrough, MA

Add topic to email alerts

Please provide your email address to receive an email when new articles are posted on Total Joint Reconstruction.

Key takeaways:

ChatGPT scored higher than orthopedic surgeons on medical student-, resident- and fellow-level questions.
Multilevel modeling showed ChatGPT had the highest credibility scores across all credibility domains.

SAN DIEGO — Results presented here showed health care descriptions created by ChatGPT were perceived to be accurate, authentic and believable by patients undergoing total hip arthroplasty compared with descriptions from orthopedic surgeons.

“Originally, we thought that AI was something that was far away and cannot rival with human expertise,” Perry L. Lim, BS, medical student researcher at Massachusetts General Hospital, told Healio about results presented at the American Academy of Orthopaedic Surgeons Annual Meeting. “But our study showed that not only is it at the level of human expertise, it sometimes even surpasses it. That is the big thing you have to worry about today.”

OT0325Lim_AAOS_Graphic_01 — Data were derived from Blackburn AZ, et al. Paper 69. Presented at: American Academy of Orthopaedic Surgeons Annual Meeting; March 10-14, 2025; San Diego.

Lim and colleagues created four questions that assessed hip arthroplasty topics based on four difficulty levels: medical student, intern, resident and fellow/attending. Each question consisted of four answers, one created by ChatGPT (OpenAI) and three from expert orthopedic surgeons.

Perry L. Lim

Lim and colleagues randomly assigned 160 patients undergoing THA to one of four questions. They were asked to compare the four answers, rating each answer on credibility – including accuracy, authenticity and believability – and to indicate their most trusted answer.

Lim said ChatGPT scored higher than surgeons on the medical student-level, resident-level and fellow-level questions.

“ChatGPT continued to score higher than some of the surgeons and scored higher across all the surgeons for the believability domain,” Lim said in his presentation. “However, for the most complex question, which was an attending-level question, ChatGPT scored at the level of the treating surgeons in terms of credibility.”

He said ChatGPT had the highest credibility scores across accuracy, authenticity and believability domains in multilevel modeling.

“In the future, making sure that all information is double checked and that we confirm with our patients that, No. 1, AI is a tool. And we want to make sure that [patients] are double checking with the orthopedic surgeons to make sure that all the things that they searched or medical information that they’ve seen is indeed correct,” Lim told Healio.

Perry L. Lim, BS, of Massachusetts General Hospital, can be reached at plim3@mgh.harvard.edu.

Published by:

Sources/Disclosures

Collapse

Source:

Blackburn AZ, et al. Paper 69. Presented at: American Academy of Orthopaedic Surgeons Annual Meeting; March 10-14, 2025; San Diego.

Disclosures: Lim reports no relevant financial disclosures.