Read more

December 14, 2023
2 min read
Save

Artificial intelligence accurately estimates diagnosis probabilities

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • An artificial intelligence chatbot estimated the likelihood of a diagnosis based on patients’ presentation and test results.
  • The chatbot was more accurate than clinicians in cases where patients received a negative test result.

An artificial intelligence chatbot demonstrated greater probabilistic reasoning than clinicians when estimating the probability of a diagnosis after a patient received a negative test result, a recent study showed.

“Humans struggle with probabilistic reasoning, the practice of making decisions based on calculating odds,” Adam Rodman, MD, MPH, an internal medicine physician at Beth Israel Deaconess Medical Center, said in a press release. “Probabilistic reasoning is one of several components of making a diagnosis, which is an incredibly complex process that uses a variety of different cognitive strategies. We chose to evaluate probabilistic reasoning in isolation because it is a well-known area where humans could use support.”

PC1223Rodman_Graphic_01_WEB
 Rodman A, et al. JAMA Netw Open. 2023;doi:10.1001/jamanetworkopen.2023.47075.

The researchers fed a learning language model (LLM), ChatGPT-4, the same five series of cases that were used previously in a national practitioner survey (n = 553) and compared the pre- and post-test probabilities of each based on presentations.

These cases included:

  • pneumonia;
  • breast cancer;
  • asymptomatic bacteriuria;
  • coronary artery disease; and
  • urinary tract infection (UTI).

The chatbot then updated its estimates after it was given test results for each case.

The researchers found the LLM had less error in pretest and post-test probability vs. the clinicians across all five cases for negative test results.

“For example, for the asymptomatic bacteriuria case, the median pretest probability was 26% for the LLM vs 20% for humans and the [mean absolute error] was 26.2 vs. 32.2 [respectively],” they wrote in JAMA Network Open.

According to Rodman, “humans sometimes feel the risk is higher than it is after a negative test result, which can lead to overtreatment, more tests and too many medications.”

However, the LLM did not perform as well after positive test results. It had greater accuracy than clinicians in two cases, was similarly accurate in two cases and less accurate in one case.

“LLM estimates were worse than human estimates for the case that was framed as a [UTI] in the question stem but was actually asymptomatic bacteriuria,” Rodman and colleagues wrote. “Some human clinicians recognized this, but the model did not and likely gave estimates assuming the diagnosis of UTI was accurate.”

They noted that the study was limited by its use of a simple input-output prompt design strategy, and the cases were simplistic “in order to have clear reference standards.”

“LLMs can’t access the outside world — they aren’t calculating probabilities the way that epidemiologists, or even poker players, do,” Rodman explained. “What they're doing has a lot more in common with how humans make spot probabilistic decisions. But that’s what is exciting. Even if imperfect, their ease of use and ability to be integrated into clinical workflows could theoretically make humans make better decisions.”

Ultimately, “future research into collective human and artificial intelligence is sorely needed,” he concluded.

References: