Read more

July 05, 2023
5 min read
Save

Q&A: ChatGPT’s ability to pick the right breast cancer screening test is ‘impressive’

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • ChatGPT 4 was 98.4% accurate at identifying appropriate breast cancer screening recommendations.
  • The models’ accuracy could help reduce burnout and administrative duties for physicians.

Multiple models of ChatGPT effectively adhered to screening guidelines by identifying correct imaging services for breast cancer and breast pain, according to researchers.

Marc D. Succi, MD, senior author and associate chair of innovation and commercialization at Mass General Brigham Radiology, along with Arya Rao, BA, an MD-PhD student at Harvard Medical School, and colleagues aimed to show the potential of artificial intelligence (AI) to enhance and support clinical decision-making.

PC0723Rao_Graphic_01_WEB
Data derived from: Rao A, et al. J Am Coll Radiol. 2023;doi:10.1016/j.jacr.2023.05.003

“Integration of an AI-based tool into existing clinical workflows and systems could drastically improve efficiency, since such tools could take advantage of the wealth of information available from patient pretest odds, diagnostic likelihood ratios, and the medical records themselves,” they wrote in Journal of the American College of Radiology.

The researchers presented 21 breast cancer or breast pain prompts to two of the most recent ChatGPT models — 3.5 and 4 — and then compared responses with the American College of Radiology’s Appropriateness Criteria to determine how they aligned with the guidance.

ChatGPT 4 achieved 98.4% accuracy for breast cancer screening recommendations and 77.7% accuracy for breast pain screening recommendations. Meanwhile, ChatGPT 3.5 achieved an accuracy of 88.9% and 58.3% for breast cancer and breast pain screening recommendations, respectively.

“In this scenario, ChatGPT's abilities were impressive,” Succi said in a press release. “I see it acting like a bridge between the referring health care professional and the expert radiologist — stepping in as a trained consultant to recommend the right imaging test at the point of care, without delay.”

Rao and Succi spoke with Healio about differences between the two ChatGPT models and the potential downsides of using AI technology for imaging services.

Healio: What’s the difference between ChatGPT 4 and other variations?

Succi: Most studies are on ChatGPT 3 or 3.5. The main difference is that 4 is trained on a lot more data, 45 gigabytes of training data compared with 17 for GPT 3. It’s multimodal, so it can take inputs of text and images. It also speaks some non-English languages better than GPT 3.5 spoke English, which is interesting. The reasoning ability, the creativity is also dialed up in GPT 4.

Rao: I would also add that the one major difference between GPT 4 and GPT 3.5 from a provider perspective is that you have to pay for GPT 4, but GPT 3.5 is publicly available. So, although GPT 4 represents a pretty significant advance in the technology, 3.5 is what most providers are going to have at hand if they want to try this out themselves.

Healio: Were you surprised by the findings, or were they expected?

Succi: I think we were surprised that it was so good, including GPT 3.5, even though GPT 3.5 was not nearly as good as GPT 4. These are generalist models that are trained on the internet and other sources of data. So, they're not subspecialty models. They're not domain specific to medicine.

The fact that a generalist model that hasn't even been fine-tuned with, let's say, patient data, which is not easily accessible, can do this well and get above 90% and pick the right imaging test for screening in GPT 4 — that's pretty great. I think it bodes well for the future of [learning language models (LLMs)] that are going to be more specific to medicine and even subsets of medicine. We would expect the performance only to improve from here on out.

Healio: What are the implications for primary care physicians?

Succi: In radiology and medicine in general, there's a lot of burnout, and there's a lot of non-medicine-related tasks that are taking over the physician’s and health care provider’s time. Stuff like billing, protocoling studies... all these little in-between communication points of contact between a primary care doctor and the radiologist take time and are prone to human error.

The big benefit here is that it's clinical decision support for noninterpretive, nondiagnostic tasks that free up the primary care doctor and the radiologist to focus on patients.

The use case here would be that you're seeing a patient, a woman in their forties, wondering about breast cancer screening, and at the point of care you can use GPT or a GPT plugin to Epic (the medical record software) that will suggest — based on the patient's age and demographics — that this patient is a candidate for this imaging, generally mammography for screening.

That's a great time reducer, and it's a great decision support system where the primary care doctor might not always remember to order these important tests.

The other thing is that just when you order a test, there's also a radiologist on the back end. If we can reduce that time sync by saying LLMs are reliable enough to replace that administrative function of the radiologist, or at least reduce the task, then we have savings on both ends — from the primary care [provider] and then also from the radiologist.

Healio: What are potential limitations or downsides to using this technology for imaging?

Succi: There are several downsides. It’s not ready for primetime. The two big ones are privacy and bias. There are ways to create firewalls and encapsulate LLMs in our health system’s medical records, where we could have access to the medical record without compromising patient data, but that's far from standardized. Then you think of primary care doctors who might be in the community. They're not going to have those resources. Copying and pasting a patient note into ChatGPT or another LLM is a big no-no for HIPAA privacy laws.

There’s also the fact that we don’t know how it’s trained, necessarily. Is it going to recommend tests more for one patient population from one area or one race or one language than others? Is it going to be more accurate or less accurate for patients of varying backgrounds? These are things that have to be benchmarked and tested. We're going to need to look at the underlying data. That's going to require collaboration and openness from companies like Open AI and Google with Med-PaLM and Bard.

Healio: Anything else to add?

Succi: I think there's a lot more to come as we start getting more LLMs, more competition in the marketplace, but also more fine-tuned models specific to medicine. We're using LLMs for a purpose they weren't necessarily designed for. We're in the very, very early days of LLMs. And I think in 5 years, we're going to look back at all this research as being important benchmarking.

Rao: I think a lot of people are concerned in this clinical decision support space about how accurate ChatGPT or LLMs are at making these decisions. And I don't know if that's really the question. I feel like it's more important at this point to think about the use cases that we're proposing and evaluating those. And to Dr. Succi’s point, the use cases are only going to grow as the technology improves.

Succi: The last point [I would make] is [that] we're not replacing radiologists or PCPs. The goal is to augment them and allow them to focus more on patient care as opposed to noninterpreted, nondiagnostic tasks. One thing we like to say is that AI won't replace doctors, but doctors who use AI will replace doctors who don't use it. So hopefully we get more doctors buying into stuff like GPT.

References: