Fact checked byKristen Dowd

Read more

September 13, 2024
3 min read
Save

ChatGPT provides better responses to complex pediatric respiratory scenarios vs. trainees

Fact checked byKristen Dowd
You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • The highest overall performance score was achieved by ChatGPT with 7.28 out of 9 points vs. trainee doctors with 4.56 points.
  • Large language model responses did not contain fabricated facts.

When presented with complex pediatric respiratory scenarios, responses from ChatGPT achieved higher performance scores than trainee doctor responses, according to a presentation at the European Respiratory Society International Congress.

Manjith Narayanan

“We are close to a world where large language models (LLMs) can assist clinicians (eg, in frontline busy clinical settings like ED, triage, general practice) in their diagnostic and management process, especially in unusual clinical situations,” Manjith Narayanan, PhD, FRCPCH, consultant in pediatric pulmonology at the Royal Hospital for Children and Young People in Edinburgh and honorary senior clinical lecturer at the University of Edinburgh, told Healio. “Before we deploy it in the frontline, we need to minimize any errors on part of the LLM and train clinicians on how to use it and the caveats on using it.”

PCON0224RetinAI_Graphic_01
When presented with complex pediatric respiratory scenarios, responses from ChatGPT achieved higher performance scores than trainee doctor responses, according to a presentation. Image: Adobe Stock

In this study, Narayanan and colleagues assessed responses from ChatGPT 3.5, Microsoft Bing and Google Bard (all LLMs) on six complex pediatric respiratory scenarios vs. responses from trainee doctors with 1-hour internet access.

“When I started out with this study, I knew about previous findings where LLMs were tested against examination scenarios (and attained a ‘pass mark,’ etc),” Narayanan said. “I thought this was due to its ability to access huge amounts of data readily and ‘remember’ them better than a human. This is why I designed the current study to remove the element of (the fallible) human memory by giving the trainee doctors access to the internet (and 1-hour time) to answer the questions.”

The titles/topics of the six scenarios included adolescent breathlessness, teen sleepy in classroom, asthma prediction, cystic fibrosis patient losing weight, neurodisability pseudomonas and cough underlying disease.

Researchers brought in six experts to score the anonymized responses based on correctness, comprehensiveness, utility, plausibility and coherence (each scored 1 to 5 points). Each response was also given an overall score ranging from 1 to 9 points, with a higher score signaling better performance.

The highest median overall score was achieved by ChatGPT with 7.28 points, followed by Bard with 5.61 points, trainee doctors with 4.56 points and Bing with 4.17 points.

“I was surprised to see that GPT and, to some extent, Bard, outperformed trainee doctors even in this case [of having internet access],” Narayanan told Healio.

Additionally, for each assessed criterion (correctness, comprehensiveness, utility, plausibility and coherence), researchers observed the highest scores in responses from ChatGPT.

When compared with trainee doctors, ChatGPT had significantly higher scores in all five criteria and overall (P < .001).

Between trainee doctors and Bard, researchers found that Bard had a significantly higher coherence score (P < .05). Bing’s various scores did not significantly differ from the scores achieved by trainee doctors.

When asked about humanness, responses from Bing and Bard were identified as AI/non-human by the experts.

Lastly, researchers said LLM responses did not contain fabricated facts.

“As LLMs get more accepted (and used) in the clinical world, it is important to avoid (and mitigate) the biases that could potentially be built in the model due to training data (and reinforcement learning with human feedback) being predominantly based on Western literature, language and culture,” Narayanan told Healio. “We need to constantly test the output of the models to ensure that they are consistently helpful to the clinicians.”

Looking ahead, Narayanan said he is already part of research further investigating LLMs.

“We are currently doing similar studies on more senior doctors (residents/registrars and below consultant grade) and on newer LLMs,” Narayanan told Healio. “Future studies will aim to look at how it performs in real life.

“It is important to state that we are not aiming to see whether large language models can replace doctors (that day is a long way in the future I hope!), but to see how well they can assist doctors,” Narayanan added.

Reference: