Prediction of population health indices from social media using kernel-based textual and temporal features

Thin Nguyen, Duc Thanh Nguyen, Mark E. Larsen, Bridianne O'Dea, John Yearwood, Dinh Phung, Svetha Venkatesh, Helen Christensen

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

8 Citations (Scopus)


From 1984, the US has annually conducted the Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture either health behaviors, such as drinking or smoking, or health outcomes, including mental, physical, and generic health, of the population. Although this kind of information at a population level, such as US counties, is important for local governments to identify local needs, traditional datasets may take years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. In this work, to predict the percentage of adults in a county reporting“insufficient sleep”, a health behavior, and, at the same time, their health outcomes, novel textual and temporal features are proposed. The proposed textual features are defined at mid-level and can be applied on top of various low-level textual features. They are computed via kernel functions on underlying features and encode the relationships between individual underlying features over a population. To further enrich the predictive ability of the health indices, the textual features are augmented with temporal information. We evaluated the proposed features and compared them with existing features using a dataset collected from the BRFSS. Experimental results show that the combination of kernel-based textual features and temporal information predict well both the health behavior (with best performance at rho=0.82) and health outcomes (with best performance at rho=0.78), demonstrating the capability of social media data in prediction of population health indices. The results also show that our proposed features gained higher correlation coefficients than did the existing ones, increasing the correlation coefficient by up to 0.16, suggesting the potential of the approach in a wide spectrum of applications on data analytics at population levels.

Original languageEnglish
Title of host publicationWWW'17 Companion - Proceedings of the 26th International Conference on World Wide Web
EditorsEugene Agichtein, Evgeniy Gabrilovich
Place of PublicationGeneva Switzerland
PublisherInternational World Wide Web Conferences Steering Committee
Number of pages9
ISBN (Electronic)9781450349147
Publication statusPublished - 2017
Externally publishedYes
EventInternational World Wide Web Conference 2017 - Perth Convention and Exhibition Centre (PCEC), Perth, Australia
Duration: 3 Apr 20177 Apr 2017
Conference number: 26th (Proceedings)


ConferenceInternational World Wide Web Conference 2017
Abbreviated titleWWW 2017
Internet address


  • Cognitive computing
  • Feature engineering
  • Geo-referenced tweets
  • Kernel-based features
  • Online texts
  • Population health indices
  • Prediction
  • Temporal information
  • Textual features

Cite this