For more than three decades, the US has annually conducted Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture health behavior and health status of its people. Though this kind of information at population level is important for local governments to identify local needs, traditional datasets take several years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. Due to the large scale of data, such as approximately two billions of tweets in this work, aggregating the tweets at a population level is common practice. While alleviating the computational cost, the aggregation operation would result in the loss of information on the distribution of data over the population, and such information may be important for identifying the health behavior and health outcomes of the population. In this work, we propose statistical features constructed on-top of primary features to predict county-level health indices. The primary features include topics and linguistic patterns extracted from tweets with county-decoded information. In addition, tweeting behaviors, particularly tweeting time, are used as a predictor of the health indices. Apache Spark, an advanced cluster computing paradigm, was employed to efficiently process the large corpus of tweets, including geo-decoding the geotags, extracting low-level (primary) features, and computing the statistical features. The results show strong correlations between publicly available health indices and the features extracted from geospatially coded Twitter data. Statistical features gained higher correlation coefficients than did the aggregation ones, suggesting the potential and applicability of the proposed features in a wide spectrum of applications on data analytics at population levels. In addition, the prediction performance was also improved when the temporal information was employed. This demonstrates that the real-time analysis of social media data can provide timely insights into the health of populations.
- Apache Spark
- Cluster computing
- Large-scale parallel and distributed implementation
- Mining spatial and temporal data
- Spatio-temporal features
- Statistical features