Read more

April 03, 2024
2 min read
Save

Specialized ChatGPT shows high accuracy in answering common patient questions

You've successfully added to your alerts. You will receive an email when new content is published.

Click Here to Manage Email Alerts

We were unable to process your request. Please try again later. If you continue to have this issue please contact customerservice@slackinc.com.

Key takeaways:

  • ChatGPT version 3.5 performed similar or better in 94% of questions asked compared with a professional society response.
  • AI-generated responses reached a college reading level on average.

Responses from an artificial intelligence-powered large language model to questions patients undergoing radiation therapy typically ask appeared accurate and concise, study results showed.

However, it typically generated responses at a college reading level, according to findings published in JAMA Network Open.

An AI-powered large language model demonstrated infographic
Data derived from Yalamanchili A, et al. JAMA Netw Open. 2024;doi:10.1001/jamanetworkopen.2024.4630.

Researchers hope that chatbots, such as ChatGPT, can serve as a tool that patients can use for educational purposes before or after meeting with a physician but should not outright replace medical visits.

“The findings underscore the potential of ChatGPT in facilitating patient-provider communication within radiation oncology,” P. Troy Teo, PhD, an instructor of radiation oncology at Northwestern University Feinberg School of Medicine, told Healio.

“However, it's evident that fine-tuning is essential to further improve accessibility and mitigate potential limitations,” Teo added. “This highlights the importance of ongoing refinement, coupled with continuous monitoring using a set of robust metrics and mechanisms, to ensure that AI-driven tools like ChatGPT effectively meet the diverse needs of patients and health care providers in real-world clinical settings.”

Background, methods

Artificial intelligence (AI)-powered large language models (LLMs) have potential in simulating human-like dialogue, which could prove beneficial for simulated patient-clinician communication within the radiation oncology field before or after physician visits.

Researchers conducted a cross-sectional study to determine the quality of responses to radiation oncology patient care questions from an LLM using domain-specific expertise and domain-agnostic metrics.

The study utilized questions and answers from websites affiliated with NCI and Radiological Society of North America; questions made queries for ChatGPT version 3.5 to prompt LLM-generated responses.

Researchers had three radiation oncologists and three radiation physicists rank the LLM-generated responses based on three factors — relative factual correctness, relative completeness and relative conciseness — all compared with online expert answers.

Experts ranked the responses on a 5-point Likert scale.

Results

Among the 115 radiation oncology questions derived from four professional society websites, experts deemed responses from the LLM as similar or better for 108 responses (94%) for relative correctness, 89 responses (77%) for completeness and 105 responses (91%) for conciseness, when compared with answers from experts.

Experts marked two of the responses from the LLM as having potential harm.

The mean (± SD) readability consensus score for expert answers ranked at 10.63 ± 3.17, compared with 13.64 ± 2.22 for LLM answers, indicating approximately a tenth-grade reading level for responses from experts and a college reading level for responses from the LLM.

Next steps

Although most responses from AI-powered LLMs can appear accurate, comprehensive and concise — with minimal risk for harm — the researchers said the higher reading level presents a potential problem for patients seeking medical advice between visits.

Such tools can be used as a resource for patient questions, with potential retraining possibly aiding in making the LLM responses even more helpful in the future.

“We were mindful of the limitations inherent in evaluating LLM-generated responses solely based on a predefined set of questions, which may not fully capture the complexity and diversity of real-world patient queries,” Teo told Healio. “This realization underscored the importance of further refining our evaluation methodology to encompass a broader spectrum of patient interactions. Our future research will incorporate a more diverse range of patient queries, including those that are not captured by the standardized questions used in our study.

“By broadening the scope of our evaluation approach, we can gain a more comprehensive understanding of the capabilities and limitations of LLMs in addressing the diverse needs of patients and health care providers,” he added.