TY - JOUR
T1 - Using spatiotemporal distribution of geocoded Twitter data to predict US county-level health indices
AU - Nguyen, Thin
AU - Larsen, Mark
AU - O'Dea, Bridianne
AU - Nguyen, Hung
AU - Nguyen, Duc Thanh
AU - Yearwood, John
AU - Phung, Dinh
AU - Venkatesh, Svetha
AU - Christensen, Helen
PY - 2020/9
Y1 - 2020/9
N2 - For more than three decades, the US has annually conducted Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture health behavior and health status of its people. Though this kind of information at population level is important for local governments to identify local needs, traditional datasets take several years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. Due to the large scale of data, such as approximately two billions of tweets in this work, aggregating the tweets at a population level is common practice. While alleviating the computational cost, the aggregation operation would result in the loss of information on the distribution of data over the population, and such information may be important for identifying the health behavior and health outcomes of the population. In this work, we propose statistical features constructed on-top of primary features to predict county-level health indices. The primary features include topics and linguistic patterns extracted from tweets with county-decoded information. In addition, tweeting behaviors, particularly tweeting time, are used as a predictor of the health indices. Apache Spark, an advanced cluster computing paradigm, was employed to efficiently process the large corpus of tweets, including geo-decoding the geotags, extracting low-level (primary) features, and computing the statistical features. The results show strong correlations between publicly available health indices and the features extracted from geospatially coded Twitter data. Statistical features gained higher correlation coefficients than did the aggregation ones, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels. In addition, the prediction performance was also improved when the temporal information was employed. This demonstrates that the real-time analysis of social media data can provide timely insights into the health of populations.
AB - For more than three decades, the US has annually conducted Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture health behavior and health status of its people. Though this kind of information at population level is important for local governments to identify local needs, traditional datasets take several years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. Due to the large scale of data, such as approximately two billions of tweets in this work, aggregating the tweets at a population level is common practice. While alleviating the computational cost, the aggregation operation would result in the loss of information on the distribution of data over the population, and such information may be important for identifying the health behavior and health outcomes of the population. In this work, we propose statistical features constructed on-top of primary features to predict county-level health indices. The primary features include topics and linguistic patterns extracted from tweets with county-decoded information. In addition, tweeting behaviors, particularly tweeting time, are used as a predictor of the health indices. Apache Spark, an advanced cluster computing paradigm, was employed to efficiently process the large corpus of tweets, including geo-decoding the geotags, extracting low-level (primary) features, and computing the statistical features. The results show strong correlations between publicly available health indices and the features extracted from geospatially coded Twitter data. Statistical features gained higher correlation coefficients than did the aggregation ones, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels. In addition, the prediction performance was also improved when the temporal information was employed. This demonstrates that the real-time analysis of social media data can provide timely insights into the health of populations.
KW - Apache Spark
KW - Cluster computing
KW - Large-scale parallel and distributed implementation
KW - Mining spatial and temporal data
KW - Spatio-temporal features
KW - Statistical features
UR - http://www.scopus.com/inward/record.url?scp=85042289313&partnerID=8YFLogxK
U2 - 10.1016/j.future.2018.01.014
DO - 10.1016/j.future.2018.01.014
M3 - Article
AN - SCOPUS:85042289313
SN - 0167-739X
VL - 110
SP - 620
EP - 628
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -