Chatbot recommendations often differ from cancer treatment guidelines
Click Here to Manage Email Alerts
Approximately one-third (34.3%) of chatbot recommendations about cancer treatment do not appear in National Comprehensive Cancer Network guidelines, study results published in JAMA Oncology showed.
Researchers developed 104 unique prompts entered via the ChatGPT interface seeking treatment recommendations on 26 cancer diagnosis descriptions. Three board-certified oncologists used five scoring criteria to assess how well ChatGPT responses agreed with recommendations in 2021 NCCN guidelines.
“We certainly didn’t expect ChatGPT to get this all right,” Danielle S. Bitterman, MD, assistant professor of radiation oncology at Harvard Medical School, told Healio. “It’s not trained to provide cancer treatment recommendations; it’s trained to be a chatbot.”
Healio spoke with Bitterman and Shan Chen, MS, also of Harvard Medical School, about their research and its potential real-world implications for oncology as AI-powered chatbot technology continues to improve in the coming years.
Healio: What motivated you to conduct this study?
Chen: We’ve been investigating large language models for some time and when ChatGPT specifically became available, we got excited about a lot of [potential] improvements for many general tasks using the technology.
One study from earlier this year really caught our attention showing how after ChatGPT launched, the volume of calls to a suicide hotline dropped to 30% of capacity. This was striking to me and shows that people want to chat with these things.
It’s not uncommon for people to Google something their doctor may tell them regarding their disease or maybe just to get a second opinion. So, we wanted to evaluate the reliability of chatbot responses.
Bitterman: There is so much excitement about these language models and they do have a lot of potential. But we were curious about the extent of clinical and medical knowledge that is embedded in these models and how easy it is to get reliable information out of them.
Healio: Can you briefly describe your findings? Did anything surprise you?
Chen: One-third of the evaluation from ChatGPT was off track compared with NCCN guidelines, which are an expert-validated set of guidelines on cancer treatment.
Additionally, hallucinations comprised 13% of the responses, which by our definition is a treatment type not mentioned in the guidelines or some treatment that doesn’t even exist. So, that 13% is a big problem.
The main surprise for me was that 98% of responses had at least one treatment that agrees with NCCN guidelines. The problem here is that ChatGPT tries to list a lot of potential treatments, which we later found out many were wrong. But we can say that at least one treatment recommendation on the list was usually valuable — it’s just hard for the patient to know which one that is.
Healio: What is the most important clinical implication of these findings?
Chen: The main point of our research was to consider potential patients using this technology to get more information. We found that ChatGPT was unreliable in this capacity.
It’s obviously hard to know how ChatGPT will change in the future. Patients should certainly do their own research if they wish, but please trust your clinician’s advice over these models.
Healio: What are the next steps in your research?
Chen: We are looking at approaches that are safer and do not directly interact with patients, such as projects on sorting inbox messages by determining which messages are more emergent and potentially have a draft for a reply to that message, which the doctor would review before sending. Things like that can help alleviate doctor stress on replying to so many inbox messages in a week — that could be very helpful.
It could also be useful in writing clinical notes, not just with drafting them but also logging or remembering certain things that might be ignored by a physician — because we’re all human and we all make mistakes.
Bitterman: These tools clearly show potential to hold knowledge, and since they’re very big computational models, I would hope it has the ability in the future to help physicians and patients manage the increasingly complex medical landscape, where there’s more data than ever before being collected on patients.
There’s new information coming out all the time and it’s very hard for a single person to mange that, so these tools can help us manage this increasing complexity, whether that’s doctors navigating electronic health records or maybe monitoring a patient’s symptoms. The technology isn’t there yet for these types of tasks, but I could see it being so in the future.
Healio: Is there anything additional you would like to emphasize?
Bitterman: This kind of early evaluation is important for setting the direction and taking the next steps toward creating an AI model that is reliable, factually correct and safe. This is an early step, but we must be careful in these early days because a mistake can end up hurting someone. That’s not only unacceptable because a patient’s health is the priority, but it can also set the field back — with us possibly missing out on the benefits of these advanced technologies.
For more information:
Shan Chen, MS, can be reached at schen73@bwh.harvard.edu.
Danielle S. Bitterman, MD, can be reached at danielle_bitterman@dfci.harvard.edu.