Fact checked byShenaz Bagha

Read more

March 10, 2025
2 min read
Save

Large language model may improve diagnosis of multiple sclerosis

Fact checked byShenaz Bagha

Key takeaways:

  • The model predicted MS diagnosis with 91.5% specificity and non-MS diagnosis with 100% specificity.
  • However, it recorded relatively lower degrees of sensitivity and precision for less clear-cut cases.

WEST PALM BEACH, Fla. — Large language model algorithms may be reliable for synthesizing large patient databases to more rapidly and accurately diagnose multiple sclerosis, according to a poster presented at ACTRIMS.

“Timely diagnosis of MS is crucial as early initiation of disease-modifying therapy can reduce disease burden and health care costs,” Shruthi Venkatesh, BS, a doctoral student at the University of Pittsburgh School of Medicine, told Healio. “[Large language models] can synthesize large volumes of multimodal data derived from clinical encounters, radiology report and laboratory testing ... to identify patterns consistent with a diagnosis of MS.”

Machine learning AI213593664
The latest research from the University of Pittsburgh suggests large language models may be reliable for analyzing large patient datasets to determine MS diagnosis. Image: Adobe Stock

Venkatesh and colleagues aimed to streamline the diagnostic process by creating large language models (LLM), which apply the McDonald criteria to MS diagnosis.

The researchers culled data from 105 individuals (mean age at first MS symptom, 10.73 years; 84% women; 90% white) from a clinic-based cohort study between 2017 and 2013. All patients were aged 18 years and older and had a record of one or more visits to a neurology clinic. The researchers also included 10 healthy controls and 10 others with a related disorder (neuromyelitis optica and myelin oligodendrocyte glycoprotein antibody-associated disease).

Next, the researchers constructed a “decision tree” based on McDonald criteria, which tracked the steps from presentation of these other disorders to a definitive diagnosis, followed by the construction of a generative pre-trained transformer model that ensured reproducibility, diverse answers to a given query as well as “low temperature” to balance the data for coherence and diversity. Further input into the model included three keywords: attacks (describing MS symptoms), lesions (found on MRI) and biomarkers (indicative of disease-related pathology).

Then, Venkatesh and colleagues provided further direction for the dataset by classifying participants into six categories bases on the status of their MS diagnosis, status of related disorder or healthy control.

The model was subsequently evaluated by both human and automated means with an added note concerning type of hallucination experienced by the algorithm; human evaluation was guided by five subcategories (non-factual, incoherence, irrelevance, overreliance, reasoning error) to determine how the model performed when analyzing the data.

According to the results, the algorithm was able to confirm an MS diagnosis with 91.5% specificity and 68.2% sensitivity while also confirming disorders that were not MS with 100% sensitivity and specificity.

For cases requiring further evidence of confirmation, the model registered high degrees of specificity but relatively lower degrees of sensitivity and low degree of precision.

Venkatesh and colleagues additionally noted that presence of algorithm hallucinations resulted in incorrect diagnosis classifications, which may result due to incomplete patient information, faulty understanding of the data, or overreliance on clinical context.

“Our findings underscore the critical need for rigorous validation and hallucination mitigation in [large language model]-enhanced diagnostic support systems before clinical implementation,” Venkatesh said.