Using social media data to help predict, rather than just report on diseases such as the Norovirus is something the Food Standards Agency (FSA) can now do at an 70% accuracy level.
Most researchers working on health and disease topics rely on information they receive from labs, hospitals and GPs. For example, collating lab reports of quantities of people getting a virus at a given time will give a good understanding when outbreaks occur, after they’ve happened. Useful? Absolutely. But how can you wrangle data to help you predict, rather than just report, on outbreaks of diseases in order to inform the wider public?
How Twitter & Lab reports give a different view
The Food Standards Agency started looking into the true potential of different data streams around people’s illnesses in 2014. After testing a variety of methods and data sources, Twitter was identified as a key platform for sharing spur-of-the-moment thoughts and feelings, like feeling unwell. Zooming in on the highly contagious (and equally notorious) Norovirus, The Agency could soon identify correlating words users shared to describe the disease, such as ‘puke’, ‘vomit’, the more scenic descriptions ‘chunder’ and ‘vomcano’ as well as other symptoms like ‘diarrhoea’.
Monitoring this manually at first, the team decided to start using Pulsar as one of the tools to keep track of activity around the disease online due to its flexibility when it came to the range of needs expressed by the FSA and the platform’s fundamentally real-time approach. Sian Thomas, Head of Information Management at the FSA, explains:
“The problem with lab reports is that they only give you the ‘hindsight’ view. They are useful, but you are working with old information which obviously doesn’t help much in terms of predicting to a certain level outbreaks of the Norovirus. Plus, it’s often elderly people who go to the GP with these types of diseases. Younger people are more likely to deal with it on their own terms and might not even visit the GP. So in that sense, the lab reports give a skewed view. It’s just very hard to say how skewed.”
That’s where Twitter could make the difference: social media data has allowed them to be more confident in their pre-existing data, the lab reports. The Agency’s analysts created an algorithm by using historical data from Twitter, and started comparing the volume of mentions to Norovirus lab reports reporting confirmed cases of the disease in the UK. Combining the data sets of the lab reports and Twitter could give them a more complete picture on the topic. Plus, the fact that there is a correlation between the two indicates that Norovirus cases are likely spread fairly evenly across age groups, meaning the unrepresentativeness of the datasets is actually unimportant.
In November 2015, the FSA Analytics team won the inaugural cross-government Data Science competition for the algorithm they created for their Norovirus research. The team have been doing this for over a year now, and the overlap and timing turned out to be even better than they had hoped for. Using their custom algorithm, the FSA can now predict outbreaks of the Norovirus at an 70% accuracy rate – which is incredibly high.
In the case of the FSA, the volumes of mentions are the proof they need to make their prediction. However, the dataset can give many more insights into the context of a common disease like the Norovirus. For example, looking at what is said, rather than how often things are said, a clear picture is painted of what people say in relation to the virus.
Using Pulsar’s word cluster visualisation (below), we see which words are mentioned together with the flu. In this case, it becomes visible that people mention natural remedies like garlic soup and cloves & red onion rather than names of pharmaceutical cures. This could have to do with the fact that pharmaceutical solutions might not be well known, not readily available, or perhaps even that people accept this particular flu is one to ‘sit out’, and there is no quick fix.
Insights like these can quickly and effectively inform those who want to reach audiences like this what topics are important to patients, who is leading the debate, and what questions are left unanswered.
How data science can help health care
In the case of the FSA, using Tweets to communicate early warnings to the general public about the Norovirus is now something they can do, thanks to their approach to different types of data. (Social) data science can definitely make a difference in health care, as Sian Thomas and the team can attest to:
“More people turn to the internet than ever to find information about disease areas, and they are open about sharing information too. Mining this information in the right way can help us raise awareness with the general public – which in turn can result in disease prevention. There is huge scope for using social data this way, we’ve only scratched the surface. I’m excited about the opportunities it offers for researchers, health care professionals, and ultimately, the wider public.”