Chat-based AI models accurately answer AF questions from patients but not physicians
Click Here to Manage Email Alerts
Key takeaways:
- ChatGPT and Bing AI gave mostly appropriate responses to patients’ queries about AF, treatments, medications and lifestyle.
- The replies to physician-level questions were less accurate, upon expert review.
ChatGPT and Bing AI provided mostly accurate responses to patients’ commonly asked questions about atrial fibrillation, but they appear not suitable to aid clinician-level decision-making, researchers reported.
“As we witness the rapid integration of AI into various aspects of our daily lives, it becomes crucial to comprehend its implications. The emergence of ChatGPT has demonstrated a tangible and noticeable outcome of AI advancement, reshaping the way we work, learn and stay informed,” Zahra Azizi, MD, MSc, postdoctoral scholar in cardiovascular medicine and clinical epidemiologist at Stanford University, told Healio. “While it is an immensely powerful tool, it is vital for both the medical community and the general public to remain mindful of its limitations. Therefore, our study's most significant finding is that despite the effectiveness of technological advancements in providing the public with reasonably reliable information, health care professionals should refrain from utilizing them for clinical decision-making. Patients should consult their health care providers for accurate clinical information and should not solely depend on this tool to enhance their knowledge. Consequently, our research offers a comprehensive overview of the limitations and risks associated with relying on such systems in a clinical setting.”
Azizi and colleagues cited a recent study published in JAMA for which researchers evaluated a research version of the ChatGPT model’s ability to provide appropriate recommendations on guideline-based preventive cardiology topics including risk factor counseling and medication information.
As Healio previously reported, the ChatGPT model provided appropriate responses to a majority of CVD prevention questions, including complex topics such as cholesterol management despite statin therapy.
For the present analysis, Azizi and colleagues evaluated performance differences between patient and clinician questions about AF using two chat-based models: ChatGPT and Bing AI.
The study was published in Circulation: Arrhythmia and Electrophysiology.
The chat-based models were asked 18 patient-level questions, designed by clinicians experienced with AF, based on common queries from patients on overall AF, treatments, medications and lifestyle. Another 18 clinician-level questions were asked of the AIs to evaluate both text and reference accuracy.
AI responses were reviewed by three expert clinicians who categorized them as appropriate or inappropriate.
Azizi and colleagues observed a high level of agreement between reviewers and the two AI models (kappa statistic, 0.77 to 1).
The researchers reported that patient-level responses were deemed appropriate for 83.3% of prompts, with 83.3% appropriateness for queries about overall AF, 100% for treatments, 100% for medications and 71.4% for lifestyle. However, the AIs provided inaccurate responses to questions pertaining to triggers of AF, including references to alcohol and coffee intake, according to the study.
“Based on our findings, we determined that the information provided to the general public was generally reliable. However, we did observe several instances of ‘hallucinations,’ where the system cited false evidence by merging various sources that do not actually exist,” Azizi told Healio. “As a result, we cannot confidently recommend relying solely on this technology for our patients. Furthermore, considering that each patient presents unique clinical indications, the answers provided by technologies like ChatGPT may not be applicable to all individuals. Therefore, engaging in discussions with health care professionals remains the most dependable source of information. While generative AIs can offer valuable overviews and insights, they should be used as complementary tools rather than exclusive sources.”
For clinician-level questions, text accuracy was reported for 33.3% of ChatGPT responses and 66.6% of Bing AI responses, whereas reference accuracy was 55.5% for ChatGPT and 50% for Bing AI.
For AI responses deemed appropriate by clinician reviewers, most referenced current U.S. and European guidelines and some cited primary literature, according to the study.
However, Aziz and colleagues reported that in response to two clinician-level queries related to AF management, ChatGPT provided references and trials that did not exist.
“We know that this technology does not possess clinical judgment and is limited to aggregating and assembling data sourced from the internet,” Azizi said. “As health care professionals, we consider not only the most up-to-date practice guidelines but also new groundbreaking trials and the clinical context, which ChatGPT does not incorporate. We recognize the growing popularity of these tools and sought to investigate and highlight the potential risks associated with relying on them for clinical decision-making.
“While these tools are valuable for gathering information, they lack the expertise and clinical training necessary for making sound judgments,” she said. “Therefore, it is crucial to refrain from utilizing them in clinical settings.”
For more information:
Zahra Azizi, MD, MSc, can be reached at zazizi@stanford.edu.