This page contains resources to enable other researchers to utilize social media data for public health.



We post social media trends to Checkout this tutorial of how to use that site.



We publish open source software to support social media analysis.

Carmen [Java, Python]
Carmen is a library for geolocating tweets. Given a tweet, Carmen will return Location objects that represent a physical location. Carmen uses both coordinates and other information in a tweet to make geolocation decisions. It’s not perfect, but this greatly increases the number of geolocated tweets over what Twitter provides.

The Python and Java versions don’t give exactly the same results due to differences in the dependencies. Going forward, our development will focus on the Python version. If you use Carmen, please cite:
Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013.

Twitter Stream Downloader [Link]
This small software package provides code for automating the downloading of data from the Twitter streaming API.

Demographer: Gender Identification for Social Media [Link]
Demographer is a Python package that identifies demographic characteristics based on a name. It’s designed for Twitter, where it takes the name of the user and returns information about his or her likely demographics.


As part of our research we collect and annotate social media datasets.

Format: Each dataset is encoded in JSON format, with one JSON record per line. Each record contains the following fields: id (the tweet id), label (a dictionary of annotations for this tweet, where key is the name of the annotation and value is the label.) Each record will either have a text field (contains the text of the tweet) or a tweet field (contains the full tweet object from Twitter.)

Flu Vaccination Tweets [Link]

This dataset contains annotations for whether a tweet is relevant to the topic of flu vaccination, and if the author intends to receive a flu vaccine. Analysis of this dataset was published in:

Xiaolei Huang, Michael C. Smith, Michael Paul, Dmytro Ryzhkov, Sandra Quinn, David Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017.

Vaccination Sentiment and Relevance Tweets [Link]

This dataset contains annotations for whether a tweet is relevant to the topic of vaccinations, and if the author is expressing a positive or negative view about vaccines. Analysis of this dataset was published in:

Michael Smith, David A. Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016.

Mark Dredze, David A. Broniatowski, Michael Smith, Karen M. Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine, 2015.

Zika Conspiracy Tweets [Link]

This dataset contains annotations for whether a tweet about Zika contains pseudo-scientific information. Analysis of this dataset was published in:

Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016.