Read more

April 03, 2024
4 min read
Save

Q&A: ChatGPT demonstrates greater clinical reasoning vs. physicians

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • ChatGPT-4 scored higher on the primary clinical reasoning measure vs. physicians.
  • AI will “almost certainly play a part” in cognitive work, including diagnosis, but further research is needed, an expert said.

Artificial intelligence performed better at clinical reasoning and processing medical data compared with internal medicine residents and attending physicians, a study in JAMA Internal Medicine showed.

Adam Rodman, MD, MPH, FACP, a general internist at Beth Israel Deaconess Medical Center, and colleagues compared ChatGPT-4’s reasoning abilities with those of 18 residents and 21 attendings at two academic medical centers.

PC0424Rodman_Graphic_01_WEB

Each physician was randomly assigned one of 20 clinical cases composed of four sequential stages of diagnostic reasoning. A prompt with identical instructions was developed and given to ChatGPT-4, which ran all 20 cases.

Clinical reasoning was measured through the Revised-IDEA (R-IDEA) score, “a validated 10-point scale evaluating four core domains of clinical reasoning documentation,” the researchers noted.

An R-IDEA score was classified as low if it was between 0 and 7 and high if it was between 8 and 10. Overall, median R-IDEA scores were:

  • 10 (interquartile range [IQR], 9-10) for ChatGPT-4;
  • 9 (IQR, 6-10) for attendings; and
  • 8 (IQR, 4-9) for residents.

The estimated probability of achieving high R-IDEA scores was 0.99 (95% CI, 0.98-1) for ChatGPT-4, 0.76 (95% CI, 0.51-1) for attendings and 0.56 (95% CI, 0.23-0.9) for residents.

Rodman and colleagues added that attendings, residents and ChatGPT-4 all performed similarly in correct clinical reasoning, diagnostic accuracy and cannot-miss diagnosis inclusion.

Healio spoke with Rodman about what led to the study, what findings particularly stood out to him and more.

Healio: What led you to conduct this research?

Rodman: It became clear quickly after the public release of ChatGPT-4 that large language models (LLMs) had sometimes unnerving abilities to make clinical diagnoses. A number of early studies — including one that we ran which was published in JAMA — confirmed this. [Editor’s note: Read more about that research here.] At the very least, when you used diagnostic criteria that have been used for decades, LLMs outperformed humans, and performed at least as well as the top specialized diagnostic computer algorithms.

But from a clinical reasoning standpoint — that is, the process of how doctors actually make medical decisions — that's not especially satisfying. Despite what TV shows like House, M.D. would suggest, we don't really go around each day solving "zebras.” Most of the cognitive parts of our job focus on sorting through sometimes conflicting information, trying to pick out patterns, and building an appropriate differential, balancing the need to think about things that could be dangerous now, such as a heart attack, with things that are far more common, like GERD.

So, I had this question — are LLMs able to do tasks that better reflect the cognitive work of physicians?

Then, my second question had to do with the concept of a Turing test. What is the standard that we're going to compare LLMs to? Often in clinical reasoning, there's no "right" answer, even when there's a final diagnosis. What's the best way to get to a diagnosis? To weigh risks and benefits in ordering tests? To collect information from a patient?

We've had to grapple with these ideas in medical education for years as we try to teach and evaluate our medical students. But these ideas haven't really mattered for AI before, because we haven't had systems capable of performing in a humanlike manner.

A secondary goal was to come up with a set of evaluations — including a validated psychometric (that is, a validated measure of a psychological construct, which in this example is the ability to build a problem representation and a differential) — that would more accurately measure what humans do during reasoning.

Healio: What are your thoughts on the findings?

Rodman: The topline finding of course is that ChatGPT-4 was better than humans — it significantly outperformed both attendings and residents in our primary outcome (the presence of clinical reasoning).

As someone who teaches clinical reasoning, I wouldn't say it was perfect — although it almost always picked out pertinent data and had good differentials, it tended to be a bit unfocused, sometimes overly general. But it consistently outperformed humans.

The other interesting finding is something we've seen in other fields — the LLM hallucinated more (that is, had incorrect reasoning). They were all minor examples — some of them so minor you would accuse us of serious nitpicking. But there's a tradeoff between creativity and hallucination. Of course, the humans "hallucinated," too (or whatever you call it when we're just plain old wrong).

Healio: What are the implications for primary care physicians?

Rodman: This study has a ton of limitations. Of note, the information, although it is reflective of real practice (it has red herrings and other extraneous information), it is highly curated and meant to teach humans. It is neither unstructured data, as might be found in a chart, nor did humans or machines collect the data.

For PCPs — and really any doctors — this is another suggestion that LLMs are almost certainly going to play a part in our cognitive work (including diagnosis) at some point, but it doesn't address the most important questions.

Healio: Where does research on AI’s clinical reasoning go from here?

Rodman: There are three major directions the research needs to go. The first is human-computer interaction. How does using an AI actually change human reasoning? Does it make us better? Worse? No changes at all? Does how, when and with whom we interact with it matter? To do these studies, we need to study doctors.

The second is unstructured (or more realistic) clinical data. A lot of these studies are looking at human-curated data, which isn't exactly wrong — that's how we talk to each other. But if AI is going to operate at scale, it's going to have to use unstructured data from the chart. We need to see more of these studies — and with robust comparison groups.

Third and most exciting, we need to test the ability of AI tools to operate "in the wild." This means actual trials with doctors and patients. This is a challenge — I don't think the previous models by which we've done algorithmic clinical decision support fit, so we need to think about how we're going to maintain the safety of our patients while also doing robust studies.

References: