Chatbot produces similar patient history summaries as internal medicine residents
Key takeaways:
- The study highlights the potential use of chatbots in assisting physicians with medical documentation.
- Future research should determine whether a newer version performs better.
History of present illness summaries created by a chatbot were graded similarly to those created by senior internal medicine residents, a recent study found.
Learning language models (LLMs) “represent a substantial advancement in generative artificial intelligence (AI) with potential applications in many industries,” Ashwin Nayak, MD, MS, a clinical professor of medicine at Stanford University School of Medicine, and colleagues wrote in JAMA Internal Medicine.

“Medical documentation is a health care use case worth examining given its notable burden on clinicians,” they added.
To determine LLMs’ potential effectiveness in medical documentation, the researchers compared history of present illness (HPI) summaries based on three patient interviews that were generated by a chatbot vs. those created by senior internal medicine residents.
The chatbot generated HPIs through a process referred to as prompt engineering. It created 10 HPIs per script after being given a basic prompt. The HPIs were analyzed and considered “acceptable” if no errors were found, Nayak and colleagues noted. Then, the prompt was modified and the process was completed twice more, for a total of three engineering prompt rounds.
In the final round, one HPI per script was selected and compared with HPIs from the residents. Both resident and chatbot HPIs were blindly graded on organization, detail and succinctness by attending physicians using a 15-point composition scale.
Overall, the mean composite scores of HPIs were 12.18 for senior residents (n = 120) and 11.23 for the chatbot (n = 30).
Residents also scored higher than the chatbot in:
- detail orientation (4.13 vs. 3.57);
- succinctness (3.93 vs. 3.7); and
- organization (4.12 vs. 3.97).
Attending physicians had an accuracy of 61% (95% CI, 0.53-0.68) when classifying an HPI as resident or chatbot generated.
Nayak and colleagues reported that the most common error made by chatbot was the addition of patient age and gender, which were not specified in the scripts.
The acceptance rate of HPIs created by chatbot decreased from 10% in the first round of prompt engineering to 3.3% in the second round, but then jumped to 43.3% in the final round.
The researchers noted that the chatbot’s performance significantly depended on the quality of the prompt given. Prompts that were not effective resulted in the chatbot reporting information that was not present in the source dialogue.
“This type of error, called a hallucination, has been noted in a prior assessment of generative AI models,” they wrote. “The generation of hallucinations in the medical record is clearly of great concern.”
As a result, collaboration between physicians and AI developers is needed to ensure prompts “are effectively engineered to optimize output accuracy” before LLMs can be safely used within clinical settings, Nayak and colleagues wrote.
The study had several limitations. Only the top chatbot responses were compared with residents’ outputs, whereas the study’s reproducibility was limited due to the chatbot’s responses differing even when given identical prompts.
“These findings underscore the potential of chatbots to aid clinicians with medical documentation,” the researchers concluded. “Further work is needed to assess the utility of LLMs in other clinical summarization tasks.”