Twitter mining for fine-grained syndromic surveillance

Paola Velardi, Giovanni Stilo, Alberto E. Tozzi, Francesco Gesualdo

Research output: Contribution to journalArticle

33 Citations (Scopus)

Abstract

Background: Digital traces left on the Internet by web users, if properly aggregated and analyzed, can represent a huge information dataset able to inform syndromic surveillance systems in real time with data collected directly from individuals. Since people use everyday language rather than medical jargon (e.g. runny nose vs. respiratory distress), knowledge of patients' terminology is essential for the mining of health related conversations on social networks. Objectives: In this paper we present a methodology for early detection and analysis of epidemics based on mining Twitter messages. In order to reliably trace messages of patients that actually complain of a disease, first, we learn a model of naïve medical language, second, we adopt a symptom-driven, rather than disease-driven, keyword analysis. This approach represents a major innovation compared to previous published work in the field. Method: We first developed an algorithm to automatically learn a variety of expressions that people use to describe their health conditions, thus improving our ability to detect health-related "concepts" expressed in non-medical terms and, in the end, producing a larger body of evidence. We then implemented a Twitter monitoring instrument to finely analyze the presence and combinations of symptoms in tweets. Results: We first evaluate the algorithm's performance on an available dataset of diverse medical condition synonyms, then, we assess its utility in a case study of five common syndromes for surveillance purposes. We show that, by exploiting physicians' knowledge on symptoms positively or negatively related to a given disease, as well as the correspondence between patients' "naïve" terminology and medical jargon, not only can we analyze large volumes of Twitter messages related to that disease, but we can also mine micro-blogs with complex queries, performing fine-grained tweets classification (e.g. those reporting influenza-like illness (ILI) symptoms vs. common cold or allergy). Conclusions: Our approach yields a very high level of correlation with flu trends derived from traditional surveillance systems. Compared with Google Flu, another popular tool based on query search volumes, our method is more flexible and less sensitive to changes in web search behaviors.

Original languageEnglish
Pages (from-to)153-163
Number of pages11
JournalArtificial Intelligence in Medicine
Volume61
Issue number3
DOIs
Publication statusPublished - 2014

Fingerprint

Health
Terminology
Language
Blogging
Allergies
Common Cold
Aptitude
Blogs
Computer Systems
Nose
Social Support
Internet
Human Influenza
Hypersensitivity
Innovation
Physicians
Monitoring
Datasets

Keywords

  • Micro-blog mining
  • Patient's language learning
  • Syndromic surveillance
  • Terminology clustering
  • Twitter mining

ASJC Scopus subject areas

  • Artificial Intelligence
  • Medicine (miscellaneous)
  • Medicine(all)

Cite this

Twitter mining for fine-grained syndromic surveillance. / Velardi, Paola; Stilo, Giovanni; Tozzi, Alberto E.; Gesualdo, Francesco.

In: Artificial Intelligence in Medicine, Vol. 61, No. 3, 2014, p. 153-163.

Research output: Contribution to journalArticle

@article{dc6122e1ff1e4d9aba440278da55322f,
title = "Twitter mining for fine-grained syndromic surveillance",
abstract = "Background: Digital traces left on the Internet by web users, if properly aggregated and analyzed, can represent a huge information dataset able to inform syndromic surveillance systems in real time with data collected directly from individuals. Since people use everyday language rather than medical jargon (e.g. runny nose vs. respiratory distress), knowledge of patients' terminology is essential for the mining of health related conversations on social networks. Objectives: In this paper we present a methodology for early detection and analysis of epidemics based on mining Twitter messages. In order to reliably trace messages of patients that actually complain of a disease, first, we learn a model of na{\"i}ve medical language, second, we adopt a symptom-driven, rather than disease-driven, keyword analysis. This approach represents a major innovation compared to previous published work in the field. Method: We first developed an algorithm to automatically learn a variety of expressions that people use to describe their health conditions, thus improving our ability to detect health-related {"}concepts{"} expressed in non-medical terms and, in the end, producing a larger body of evidence. We then implemented a Twitter monitoring instrument to finely analyze the presence and combinations of symptoms in tweets. Results: We first evaluate the algorithm's performance on an available dataset of diverse medical condition synonyms, then, we assess its utility in a case study of five common syndromes for surveillance purposes. We show that, by exploiting physicians' knowledge on symptoms positively or negatively related to a given disease, as well as the correspondence between patients' {"}na{\"i}ve{"} terminology and medical jargon, not only can we analyze large volumes of Twitter messages related to that disease, but we can also mine micro-blogs with complex queries, performing fine-grained tweets classification (e.g. those reporting influenza-like illness (ILI) symptoms vs. common cold or allergy). Conclusions: Our approach yields a very high level of correlation with flu trends derived from traditional surveillance systems. Compared with Google Flu, another popular tool based on query search volumes, our method is more flexible and less sensitive to changes in web search behaviors.",
keywords = "Micro-blog mining, Patient's language learning, Syndromic surveillance, Terminology clustering, Twitter mining",
author = "Paola Velardi and Giovanni Stilo and Tozzi, {Alberto E.} and Francesco Gesualdo",
year = "2014",
doi = "10.1016/j.artmed.2014.01.002",
language = "English",
volume = "61",
pages = "153--163",
journal = "Artificial Intelligence in Medicine",
issn = "0933-3657",
publisher = "Elsevier",
number = "3",

}

TY - JOUR

T1 - Twitter mining for fine-grained syndromic surveillance

AU - Velardi, Paola

AU - Stilo, Giovanni

AU - Tozzi, Alberto E.

AU - Gesualdo, Francesco

PY - 2014

Y1 - 2014

N2 - Background: Digital traces left on the Internet by web users, if properly aggregated and analyzed, can represent a huge information dataset able to inform syndromic surveillance systems in real time with data collected directly from individuals. Since people use everyday language rather than medical jargon (e.g. runny nose vs. respiratory distress), knowledge of patients' terminology is essential for the mining of health related conversations on social networks. Objectives: In this paper we present a methodology for early detection and analysis of epidemics based on mining Twitter messages. In order to reliably trace messages of patients that actually complain of a disease, first, we learn a model of naïve medical language, second, we adopt a symptom-driven, rather than disease-driven, keyword analysis. This approach represents a major innovation compared to previous published work in the field. Method: We first developed an algorithm to automatically learn a variety of expressions that people use to describe their health conditions, thus improving our ability to detect health-related "concepts" expressed in non-medical terms and, in the end, producing a larger body of evidence. We then implemented a Twitter monitoring instrument to finely analyze the presence and combinations of symptoms in tweets. Results: We first evaluate the algorithm's performance on an available dataset of diverse medical condition synonyms, then, we assess its utility in a case study of five common syndromes for surveillance purposes. We show that, by exploiting physicians' knowledge on symptoms positively or negatively related to a given disease, as well as the correspondence between patients' "naïve" terminology and medical jargon, not only can we analyze large volumes of Twitter messages related to that disease, but we can also mine micro-blogs with complex queries, performing fine-grained tweets classification (e.g. those reporting influenza-like illness (ILI) symptoms vs. common cold or allergy). Conclusions: Our approach yields a very high level of correlation with flu trends derived from traditional surveillance systems. Compared with Google Flu, another popular tool based on query search volumes, our method is more flexible and less sensitive to changes in web search behaviors.

AB - Background: Digital traces left on the Internet by web users, if properly aggregated and analyzed, can represent a huge information dataset able to inform syndromic surveillance systems in real time with data collected directly from individuals. Since people use everyday language rather than medical jargon (e.g. runny nose vs. respiratory distress), knowledge of patients' terminology is essential for the mining of health related conversations on social networks. Objectives: In this paper we present a methodology for early detection and analysis of epidemics based on mining Twitter messages. In order to reliably trace messages of patients that actually complain of a disease, first, we learn a model of naïve medical language, second, we adopt a symptom-driven, rather than disease-driven, keyword analysis. This approach represents a major innovation compared to previous published work in the field. Method: We first developed an algorithm to automatically learn a variety of expressions that people use to describe their health conditions, thus improving our ability to detect health-related "concepts" expressed in non-medical terms and, in the end, producing a larger body of evidence. We then implemented a Twitter monitoring instrument to finely analyze the presence and combinations of symptoms in tweets. Results: We first evaluate the algorithm's performance on an available dataset of diverse medical condition synonyms, then, we assess its utility in a case study of five common syndromes for surveillance purposes. We show that, by exploiting physicians' knowledge on symptoms positively or negatively related to a given disease, as well as the correspondence between patients' "naïve" terminology and medical jargon, not only can we analyze large volumes of Twitter messages related to that disease, but we can also mine micro-blogs with complex queries, performing fine-grained tweets classification (e.g. those reporting influenza-like illness (ILI) symptoms vs. common cold or allergy). Conclusions: Our approach yields a very high level of correlation with flu trends derived from traditional surveillance systems. Compared with Google Flu, another popular tool based on query search volumes, our method is more flexible and less sensitive to changes in web search behaviors.

KW - Micro-blog mining

KW - Patient's language learning

KW - Syndromic surveillance

KW - Terminology clustering

KW - Twitter mining

UR - http://www.scopus.com/inward/record.url?scp=84904296459&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904296459&partnerID=8YFLogxK

U2 - 10.1016/j.artmed.2014.01.002

DO - 10.1016/j.artmed.2014.01.002

M3 - Article

VL - 61

SP - 153

EP - 163

JO - Artificial Intelligence in Medicine

JF - Artificial Intelligence in Medicine

SN - 0933-3657

IS - 3

ER -