Predicting Flu Outbreaks with Twitter

It's that time of year again: the flu is alive and well. While health authorities know that flu activity peaks from December through March, they often don't know where it will strike geographically, or when it will be most burdensome. Not knowing specific answers to these questions can overburden hospitals, catching authorities unprepared and patients untreated.

To monitor flu surges, health professionals traditionally rely on physicians reporting the number of flu-related hospitalizations and out-patient visits. Such a system results in a limited sample of those infected and, with reporting delays, it often takes a slow 1-2 weeks to detect flu surges.

However, in recent years, health researchers have turned to data science techniques and Twitter to identify flu trends as they're happening. Access to real-time temporal and geographic flu trend data allows authorities to efficiently allocate resources where, and when, they're needed most. In this post, we'll explore a typical pipeline for identifying temporal and geographic flu trends using Twitter data.

3D representation of Influenza Virion, reproduced from CDC Public Health Image Library; CDC/Doug Jordan, M.A., Image ID #11881

3D representation of Influenza Virion, reproduced from CDC Public Health Image Library; CDC/Doug Jordan, M.A., Image ID #11881

Finding Tweets about Legitimate Flu Symptoms

Much like predicting a Yelp review's rating from its text alone, health researchers can predict whether or not a tweet refers to a legitimate flu infection or not. Aggregating the number of legitimate flu infection tweets by week or month allows researchers to evaluate temporal flu trends. With geographic information also attached to each tweet, researchers can track the flu's spatial extent. But how, specifically, do researchers identify flu-related tweets in the first place? A typical pipeline for identifying flu-related tweets is as follows:

1.        Identify health-related tweets:

First, researchers need to figure out which tweets refer to health-related phenomena. A simple text filter for tweets with health-related words in them can be misleading, though. Phrases such as "I'm sick of this" can confuse a classifier, so Paul and Dredze 2014 advocate first training a binary classifier to distinguish between health-related and spurious tweets.

To build a training dataset for their 2014 study, Paul and Dredze labeled whether a training dataset of 5128 tweets were health-related using Amazon's crowdsourcing service, Mechanical Turk. Workers labeled tweets as either positive (health-related), negative (not health-related), or ambiguous.

Then, dropping "ambiguous" tweets and extraneous information like hashtags and usernames, Paul and Dredze split tweets up into n-grams of size 1, 2, and 3. N-grams are sequences of adjacent words grouped either by themselves as a unigram, in twos as bigrams, or threes as trigrams. For instance, using Python's Natural Language Toolkit, nltk, you can split up the tweet "I am sick of this" into a list of single words (unigrams):

>>> import nltk
>>> words = nltk.word_tokenize("I am sick of this")
>>> print(words)
['I', 'am', 'sick', 'of', 'this']

We can also produce bigrams and trigrams for the same tweet:

>>> bigrams = nltk.bigrams(words)
>>> trigrams = nltk.trigrams(words)
>>> print([i for i in bigrams])
[('I', 'am'), ('am', 'sick'), ('sick', 'of'), ('of', 'this')]
>>> print([i for i in trigrams])
[('I', 'am', 'sick'), ('am', 'sick', 'of'), ('sick', 'of', 'this')]

In order to form features for training their binary classifier, Paul and Dredze counted the number of times each n-gram occurs in a tweet, under the assumption that word sequences in health-related tweets would be different than those in non-health-related tweets. Note that you can perform the entire n-gram creation process as well as produce n-gram counts for a corpus of tweets in Python's sklearn package as follows. Using sklearn to produce your feature vector makes it easier to train the binary classifier later on in sklearn because you won't need to convert the feature vector to a new format. For this example, let's add the tweet "I am sick with the flu", so that we have both a legitimate health-related tweet and a non-health-related tweet in our corpus.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(ngram_range=(1,3), token_pattern=r'(\w+)')
>>> # Produce Feature Vector:
>>> X = vectorizer.fit_transform(["I am sick of this", "I am sick with the flu"])

Note that CountVectorizer transforms each sentence into an array of counts (the number of times each n-gram is used within a tweet):

>>> print(vectorizer.get_feature_names())
[u'am', u'am sick', u'am sick of', u'am sick with', u'flu', u'i', u'i am', u'i am sick', u'of', u'of this', u'sick', u'sick of', u'sick of this', u'sick with', u'sick with the', u'the', u'the flu', u'this', u'with', u'with the', u'with the flu']
>>> print(X.toarray())
[[1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0]
 [1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1]]

Thus, because the tweet "I am sick with the flu" contains one "am" and one "am sick", but no "am sick of", the first three entries in its count array are [1,1,0]. Training a classifier to identify further health related tweets from an unlabeled corpus can also easily be done within sklearn. Paul and Dredze used a binary logistic regression model to classify health-related tweets for their 2014 study, so let's do the same:

>>> from sklearn.linear_model import LogisticRegression
>>> Y = ["Negative", "Positive"] # Only the second tweet in this dummy dataset is health-related
>>> model = LogisticRegression()
>>> model.fit(X, Y)

Paul and Dredze tuned the prediction threshold using 10-fold cross validation, resulting in 68% precision and 72% recall. Applying the classifier to over 4 billion tweets in their dataset, they identified 144 million health-related tweets.

2.        Predict flu infections in the identified health-related tweets using topic models:

Paul and Dredze (2014) removed punctuation, stop words, URLs and hashtags from the 144 million identified health-related tweets. They then trained unsupervised Latent Dirichlet Allocation (LDA) topic models on the tweets to identify health "topics" that underlie the unigrams (words) in each tweet. In LDA, a tweet's words are assumed to have been drawn at certain probabilities from mixtures of topics (probability distributions over words). So, if a tweet is about the flu, we would expect that tweet's words to come from a general mixture of topics corresponding to the flu (for instance, symptoms of the flu). See Edwin Chen's excellent introduction to LDA for more details and examples of the method. Note that there are many approaches to topic modeling in addition to LDA, each of which has varying success depending on the dataset.

In Python, you can use the gensim package to arrive at 20 topics from a corpus of tweets with LDA:

>>> from gensim import matutils, models
>>> # Convert sklearn feature vector into gensim compatible format as a corpus:
>>> corpus = matutils.Sparse2Corpus(X)
>>> lda = models.LdaModel(corpus, num_topics=20)
>>> lda.print_topics(20)

Paul and Dredze's resulting topics are too long to comfortably print here (you can see the full set of features and weights in their article data). However, their first topic starts off in the following way, featuring the probability of each word given topic 0:

topic #0: 0.045*throat + 0.026*ear + 0.022*eyes + ...

Two people manually annotated Paul and Dredze's resulting topics with the names of illnesses most related to the words in the model. These individuals arrived at seventeen topics in common--one of which was the flu. Check out the words most associated with the flu topic, sized by the probability of each word within the topic probability distribution:

Source: atam.topwords.csv, from Paul and Dredze 2014; word cloud created using Wordle.

Source: atam.topwords.csv, from Paul and Dredze 2014; word cloud created using Wordle.

Based on the prevalence of high probability words in a tweet, it is simple to determine the topic probability distribution of a novel tweet corresponding to the flu. For instance, gensim allows you to easily calculate the probability that a tweet belongs to the "flu" topic and predict whether a tweet is about the flu. Simply represent the tweet as a "bag of words" (vector representation of each word and its wordcount within the tweet) and print the resulting topic probability distribution for the tweet: 

>>> print(lda[bag_of_words])

 

3.        Identify legitimate flu topic tweets referring specifically to personal illness:

Not all tweets refer to the person who is tweeting, however. A Twitter user might comment generally about the flu ("I hope I don't get the flu!"), or refer to public figures with the flu ("Donald Trump looks like he has the flu"). To this effect, Lamb et al. use a log-linear model to filter flu-related tweets even further based on semantic features like pronoun use and verb tense 2013.

4.        Geolocate tweets with the Carmen geolocation system:

Only a very small percentage of tweets have GPS coordinates attached to them, but geographic location detection can be improved by user-supplied profile information in the form of city, county, state, and/or country. Paul and Dredze used Carmen to assign locations to recorded flu tweets based on all of these factors 2013.

So, for instance, using a sample tweet in the standard Twitter API JSON format, you can use Carmen to narrow in on a location when there are no geographic coordinates attached to the tweet.

>>> import json
>>> import carmen

>>> with open('sample_tweet.json') as json_data:
>>>     sample_tweet = json.load(json_data)

>>> resolver = carmen.get_resolver()
>>> resolver.load_locations()
>>> location = resolver.resolve_tweet(sample_tweet)
>>> print(location)
(False, Location(country=u'United States', state=u'Texas', known=True, id=3067))

In the case of the sample tweet, the user's location is set to "Texas" in the profile information, so Carmen returns "Texas" as the user's geographic location. If more specific city information were included, Carmen would also pick this up and provide more fine-tuned results.

A Summary of the Legitimate Flu-Related Tweet Identification Pipeline:

In summary, the process by which legitimate flu-related tweets may be identified is as follows:

1.        Identify health-related tweets.

2.        Predict flu infections in the identified health-related tweets using topic models.

3.        Identify legitimate flu topic tweets referring specifically to personal illness.

4.        Geolocate tweets with the Carmen geolocation system.

Once you've gone through this process and all that remains are legitimate flu-related tweets with geospatial data, you're finally ready to start identifying flu surges in space and time. Identifying surges simply requires you to aggregate tweets by the time they were tweeted (automatically included with each tweet) and their Carmen-identified geographic location.

Identifying Temporal and Geographic Surges in Flu Symptoms

For instance, Broniatowski et al. aggregated flu-related tweets by week and estimated temporal surges in influenza prevalance on a week-by-week basis (2013). Normally, influenza prevalence is measured by the CDC using outpatient visits as a proxy. Broniatowski et al., however, found that the number of flu-related tweets (identified using the infection detection algorithm described in the previous section) per 1000 tweets matched up well with CDC-reported trends from September 2012 through June 2013 (2013). You can see the close match between the two weekly measures of flu prevalance in the graph below:

 

Their results hint that, at least temporally, a combination of Twitter and a good infection detection algorithm can successfully identify flu prevalance in real-time. But what about geography? Can we identify specifically where flu surges occur?

Unfortunately, at this point, there is no geographic gold standard data to easily compare Twitter-derived geographic flu trends and assess how well they perform. That being said, Nagar et al. 2014, aggregated flu-related tweets for New York City by week from October 2012 to May 2013. They produced maps of flu-related tweets by week, further decomposed by the probability that each entry belonged to the "flu" topic. Then, just like a meteorological wind map, they indicated spatial flu trends for the week with an overall vector map pointing to locations with concentrated flu incidents. Check out their resulting maps for a week in their study time period:

While they could not evaluate their results against a known standard, Nagar et al. argued that tweet vector maps may, at the very least, be useful visualizations of city-level flu trends, giving health professionals the ability to pinpoint flu surge locations in real-time and prepare nearby hospitals for the burden.

Final Thoughts

In summary, data-driven methods for predicting spatial and temporal flu trends produce encouraging results and provide a faster way to identify flu surges. Notably, the Twitter-based methods mirror advances in Marketing for identifying geospatial trends in brand image, as well as in Urban Planning for analyzing public attitudes towards various spaces and landmarks.

Twitter-based data science is not a cure-all for health professionals yet, however. While using Twitter as a data source is fast and appears to produce comparable results to manual methods, self-reporting on medical issues can lead to inaccuracies. For instance, lay-people often cannot tel