Issue: December 2014
December 01, 2014
7 min read
Save

Web traffic shows promise as predictor for disease outbreaks

Issue: December 2014

Researchers are using online traffic data to improve the speed and accuracy of outbreak estimates around the world.

Perspective from William Schaffner, MD

Traditionally, disease observation records are collected from clinical diagnoses, ED admissions, over-the-counter medication sales, and school or work absenteeism. While outbreak surveillance is vital to lessening the impact of many infectious diseases, it often takes public health groups such as the CDC about 1 to 2 weeks to compile this data for an early estimate, with more precise counts released later as additional sources report their numbers.

This practice is accurate, but costly and time consuming. Conversely, predictive models use previously collected outbreak data to project upcoming results, but often at the cost of reliable accuracy beyond a few weeks.

According to recent data, a solution to this tradeoff could be found online where millions of users log disease-related queries on a regular basis. By using real-time data freely available from popular websites and online services, researchers have outlined how measuring web traffic may translate into faster outbreak results.

Google search analytics

“It’s true that simply using the number of searches as an estimate of flu levels can result in misleading figures,” Tobias Preis, PhD, of the business school at the University of Warwick, United Kingdom, said in a press release. “However, simple models can be built to watch out for increases in searches that do not correspond to increases in reports of flu, and which use this information to improve upcoming estimates.”

In a study published in Royal Society Open Science, Preis and Susannah Moat, PhD, created a disease-reporting model based on historic influenza data collected from the CDC’s US Outpatient Influenza-like Illness Surveillance Network (ILINet) and weekly Google Flu Trends queries for searches relating to influenza symptoms. They compared their integrated model’s predictive accuracy with one based solely on CDC database information.

Susannah Moat

Susannah Moat

The mean absolute error of the researchers’ integrated model was lower when estimating previously known values (0.114 vs. 0.131) and when predicting illness (0.133 vs. 0.162). By using search data, the integrated model reduced mean absolute error for predictions from 16% to 52.7%, depending on how many weeks of data were used to train the model.

Because the algorithm used by Google Flu Trends was revised near the end of the 2010-2013 study period, a large portion of the data collected represents the older search algorithm.

“Our results show that public health professionals can indeed use data on the number of Google searches for flu-related symptoms to improve their estimates of how many people have the flu right now, as long as their analysis takes simple precautions to allow for the fact that human behavior can change across time,” Preis said in the release.

Wikipedia page views

According to Sara Y. Del Valle, PhD, of the Los Alamos National Laboratory in New Mexico, global disease forecasting has the potential to change the way public health officials respond to epidemics.

“In the same way we check the weather each morning, individuals and public health officials can monitor disease incidence and plan for the future based on today’s forecast,” Del Valle said in a press release.

Del Valle and colleagues analyzed Wikipedia page view records from March 7, 2010 to Feb. 1 for various infectious disease page views, as well as proxy data to determine user locations worldwide. Much like Preis and Moat, they used disease incidence data collected from WHO epidemiological reports to create their own models.

Of the 14 disease-location contexts analyzed, eight were successful and six were not. Cases that researchers considered successful had r2 values ranging from 0.92 to 0.66, and could forecast values up to 28 days in advance. Researchers suspect that people use Wikipedia to gather information about diseases before seeking medical attention.

Three of the failed cases were due to patterns in official data that were too subtle for the model to recognize; the others were inaccurate because the signal-to-noise ratio in Wikipedia data also was too subtle.

Despite this, further areas of study and revision were detailed that the researchers said could improve this method of outbreak monitoring for future studies.

Tobias Preis

Tobias Preis

“The goal is to build an operational disease-monitoring and forecasting system with open data and open source code,” Del Valle said. “This paper shows we can achieve that goal.”

Twitter conversations

While researchers analyze web searches and page views for their outbreak information, other investigators found it more fruitful to simply listen in on what users were saying.

Michael J. Paul, PhD candidate at Johns Hopkins University, and colleagues used an influenza surveillance system based on Twitter data to create their models. Able to filter out media awareness campaigns and other confounders, the researchers combined the data from this system with outbreak surveillance data from the CDC and compared it with a baseline historical prediction model. Prediction data also were collected from Google Flu Trends for additional analysis.

Forecasts were made during the 2011-2012, 2012-2013 and 2013-2014 influenza seasons (Nov. 27-April 5) using week-by-week observational and historic data. Comparisons were made to ILINet reports immediately after release and when the CDC released its more accurate estimates weeks later.

Researchers found that a model combining Twitter and historical data outperformed one that only relied on the latter. Using the Twitter model reduced nowcasting error by 29.6%, which dipped to 6.09% when using the CDC’s final estimates. The Twitter model was frequently more accurate than the baseline when forecasting outbreak estimates, with 10-week predictions that had fewer errors than the baseline model of 4 weeks earlier.

For current estimates, Google Flu Trends only reduced error over the baseline during a single influenza season, and it was outperformed when making future predictions. These results conflicted with Preis’ study, among others, although the researchers wrote that the inclusion of one season when Google Flu Trends performed much worse than usual could be an explanation. The algorithm used by the service also has been updated since the time of the study periods.

“There are several benefits to using Twitter over [Google Flu Trends], including the ubiquity, openness, public availability and ease of use of Twitter data,” the researchers wrote. “These factors have led the wider academic community to focus on Twitter, especially in light of recent poor performance of [Google Flu Trends], and the attendant concerns about using metrics based on proprietary data and algorithms. As we collect additional years of tweets, we will be able to make broader claims about the relative utility of Google and Twitter data.”

References:

Generous N. PLoS Comput Biol. 2014;doi:10.1371/journal.pcbi.1003892.
Paul MJ. PLoS Curr. 2014;doi:10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117.
Preis T. R Soc open sci. 2014;doi:10.1098/rsos.140095.

Disclosure: Paul reported support from a Microsoft Research PhD fellowship, and currently serves on the advisory board for Sickweather. One of his colleagues also reported receipt of compensation for talks and consultation from Directing Medicine, Progeny Systems and Sickweather. No other researchers report any relevant financial disclosures.