A new tidy data structure to support exploration and modeling of temporal data

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Mining temporal data for information is often inhibited by a multitude of formats: regular or irregular time intervals, point events that need aggregating, multiple observational units or repeated measurements on multiple individuals, and heterogeneous data types. This work presents a cohesive and conceptual framework for organizing and manipulating temporal data, which in turn flows into visualization, modeling, and forecasting routines. Tidy data principles are extended to temporal data by: (1) mapping the semantics of a dataset into its physical layout; (2) including an explicitly declared “index” variable representing time; (3) incorporating a “key” comprising single or multiple variables to uniquely identify units over time. This tidy data representation most naturally supports thinking of operations on the data as building blocks, forming part of a “data pipeline” in time-based contexts. A sound data pipeline facilitates a fluent workflow for analyzing temporal data. The infrastructure of tidy temporal data has been implemented in the R package, called tsibble. Supplementary materials for this article are available online.

Original languageEnglish
Number of pages13
JournalJournal of Computational and Graphical Statistics
DOIs
Publication statusAccepted/In press - 2020

Keywords

  • Data pipelines
  • Data science
  • Data wrangling
  • Forecasting
  • Longitudinal data
  • Time series

Cite this