Fact checked byKristen Dowd

Read more

October 05, 2023
4 min read
Save

AI tools result in more false-positive findings on chest X-rays vs. radiologists

Fact checked byKristen Dowd
You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • AI tools had moderate to high sensitivity in detecting three lung conditions but a higher rate of false positives than radiologists.
  • AI sensitivity went down when the diseases were smaller on X-rays.

Radiologists had a lower rate of false positives when detecting airspace disease, pneumothorax and pleural effusion on chest X-rays compared with artificial intelligence tools, according to results published in Radiology.

“Our study found that radiologists generally outperformed the AI tools,” Louis Lind Plesner, MD, resident radiologist and PhD fellow in the department of radiology at Herlev and Gentofte Hospital in Copenhagen, Denmark, told Healio. “This contrasts with some previous studies and the general public belief in AI systems. For us, it was not surprising because we believe that this finding is due to the fact that the radiologists were working in the clinic and not reading in a research setting. This means that the radiologists had access to patient information and previous radiographs, CTs and so on, giving them a large advantage over the AIs, which only ‘look’ at the image pixels.

Infographic showing rate of false-positives findings on chest X-rays.
Data were derived from Plesner LL, et al. Radiology. 2023;doi:10.1148/radiol.231236.

“It seems the AIs are excellent at ‘looking’ at the image pixels, but next-generation AI should also focus on incorporating patient information and prior radiographs to make the interpretation of image findings better by the AI. Then — in my opinion — it would be very likely that the AIs could match the radiologists, though this is pure speculation.”

In a retrospective study, Plesner and colleagues assessed 2,040 adults (median age, 72 years; 1,033 women) with a chest radiograph to compare the accuracy of four commercially available AI tools with that of two to three radiologists (n = 72) in detecting airspace disease, pneumothorax and pleural effusion.

The following AI tools were featured in this study: Annalise Enterprise CXR (version 2.2; Annalise.ai), SmartUrgences (version 1.24; Milvue, Incepto), ChestEye (version 2.6; Oxipit) and AI-Rad Companion (version 10; Siemens Healthineers).

From the total cohort, 669 patients (32.8%) had a minimum of one of the outlined conditions, and this included 393 X-rays with airspace disease, 365 with pleural effusions and 78 with pneumothorax.

AI performance

Among the X-rays assessed by the AI tools, the highest area under the receiver operating characteristic curve was observed for pleural effusion (range, 0.94-0.97), followed by pneumothorax (range, 0.89-0.97) and airspace disease (range, 0.83-0.88).

Researchers also observed that the AI tools had moderate to high sensitivity, with rates ranging from 62% to 95% for pleural effusion, 63% to 90% for pneumothorax and 72% to 91% for airspace disease. Notably, radiologists had similar sensitivity rates for each disease.

When presented with smaller-sized versions of the three conditions, AI sensitivity went down (airspace disease, 33%-61%; pneumothorax/pleural effusion, 9%-94%), according to researchers.

In terms of specificity, when given chest X-rays with normal or single findings, the AI tools had high specificity in detecting pleural effusion (range, 95%-100%), pneumothorax (range, 99%-100%) and airspace disease (range, 85%-96%). However, four or more findings on an X-ray resulted in decreased specificity for each of the conditions (pleural effusion, 65%-92%; pneumothorax, 96%-99%; airspace disease, 27%-69%).

“Clinicians should know that in general the AI will not miss major findings, but can tend to overdiagnose, especially in complex patients with multiple lung X-ray findings,” Plesner told Healio. “They should know that the AI will likely not be as good as a radiologist for actually interpreting the findings but can be very good at detecting something abnormal.”

For each of the conditions, researchers reported that the AI tools had high negative predictive values (range, 92%-100%) and reduced positive predictive values (airspace disease range, 37%-55%; pneumothorax range, 60%-86%; pleural effusion range, 56%-84%).

AI tools vs. radiologists

Compared with the radiologists’ reports on chest radiographs, researchers found more false positives with most of the AI tools for airspace disease (range, 13.7%-36.9% vs. 11.6%; P range < .001 to .01); pneumothorax (range, 1.1%-2.4% vs. 0.2%; P < .001 for all), excluding one AI tool that had a lower false-positive rate of 0.4%; and pleural effusion (range, 7.7%-16.4% vs. 4.2%; P < .001 for all).

Notably, false-negative rates did not differ between the AI tools and radiology reports, according to researchers.

“We were surprised that the AI tools had such a high variability between them in false-negative rate and false-positive rate,” Plesner told Healio. “Any diagnostic test will have to balance these two parameters, and no tool had both the ‘best’/lowest false-negative rate and false-positive rate. This tells us that you should run a test before any implementation of these tools to figure out the balance between the two parameters in your own population, to figure out which impact it might have. In other words, it will not be possible to guarantee a certain false-negative rate and false-positive rate before implementation.”

Future research should consider the patient and the impact AI has on them, Plesner said.

“For example, [research should address] if more important pathology is found when using AI, which could benefit the patients, or if AI assistance will produce more unnecessary follow-up on patients due to false positives, which could harm the patients,” Plesner said.

Future with AI

This study by Plesner and colleagues demonstrates the value of AI for different lung disease diagnoses, but more advancements in AI are needed for fewer false-positive findings, according to an accompanying editorial by Masahiro Yanagawa, MD, PhD, associate professor of radiology at Osaka University Graduate School of Medicine, and Noriyuki Tomiyama, MD, PhD, professor and chairman of the department of radiology at Osaka University Graduate School of Medicine.

Notably, the lack of results categorized by the physician’s level of experience in this study was a missed opportunity, Yanagawa and Tomiyama wrote.

“A main limitation of this study was that the authors did not evaluate the results according to physicians of different experience levels, with and without AI tools, compared with the reference standard,” Yanagawa and Tomiyama wrote. “Although this analysis is not feasible when testing multiple AI tools, it will be essential to investigate the impact of using AI on physician performance. It would be more interesting to evaluate the performance of AI according to the actual clinical environment.”

References: