A framework for automated anomaly detection in high frequency water-quality data from in situ sensors

Catherine Leigh, Omar Alsibai, Rob J. Hyndman, Sevvandi Kandanaarachchi, Olivia C. King, James M. McGree, Catherine Neelamraju, Jennifer Strauss, Priyanga Dilini Talagala, Ryan D.R. Turner, Kerrie Mengersen, Erin E. Peterson

Research output: Contribution to journalArticleResearchpeer-review

4 Citations (Scopus)

Abstract

Monitoring the water quality of rivers is increasingly conducted using automated in situ sensors, enabling timelier identification of unexpected values or trends. However, the data are confounded by anomalies caused by technical issues, for which the volume and velocity of data preclude manual detection. We present a framework for automated anomaly detection in high-frequency water-quality data from in situ sensors, using turbidity, conductivity and river level data collected from rivers flowing into the Great Barrier Reef. After identifying end-user needs and defining anomalies, we ranked anomaly importance and selected suitable detection methods. High priority anomalies included sudden isolated spikes and level shifts, most of which were classified correctly by regression-based methods such as autoregressive integrated moving average models. However, incorporation of multiple water-quality variables as covariates reduced performance due to complex relationships among variables. Classifications of drift and periods of anomalously low or high variability were more often correct when we applied mitigation, which replaces anomalous measurements with forecasts for further forecasting, but this inflated false positive rates. Feature-based methods also performed well on high priority anomalies and were similarly less proficient at detecting lower priority anomalies, resulting in high false negative rates. Unlike regression-based methods, however, all feature-based methods produced low false positive rates and have the benefit of not requiring training or optimization. Rule-based methods successfully detected a subset of lower priority anomalies, specifically impossible values and missing observations. We therefore suggest that a combination of methods will provide optimal performance in terms of correct anomaly detection, whilst minimizing false detection rates. Furthermore, our framework emphasizes the importance of communication between end-users and anomaly detection developers for optimal outcomes with respect to both detection performance and end-user application. To this end, our framework has high transferability to other types of high frequency time-series data and anomaly detection applications.

Original languageEnglish
Pages (from-to)885-898
Number of pages14
JournalScience of the Total Environment
Volume664
DOIs
Publication statusPublished - 10 May 2019

Keywords

  • Big data
  • Forecasting
  • Near-real time
  • Quality control and assurance
  • River
  • Time series

Cite this

Leigh, Catherine ; Alsibai, Omar ; Hyndman, Rob J. ; Kandanaarachchi, Sevvandi ; King, Olivia C. ; McGree, James M. ; Neelamraju, Catherine ; Strauss, Jennifer ; Talagala, Priyanga Dilini ; Turner, Ryan D.R. ; Mengersen, Kerrie ; Peterson, Erin E. / A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. In: Science of the Total Environment. 2019 ; Vol. 664. pp. 885-898.
@article{14b13d0b2abc4a24b10e9968cc9cf5d7,
title = "A framework for automated anomaly detection in high frequency water-quality data from in situ sensors",
abstract = "Monitoring the water quality of rivers is increasingly conducted using automated in situ sensors, enabling timelier identification of unexpected values or trends. However, the data are confounded by anomalies caused by technical issues, for which the volume and velocity of data preclude manual detection. We present a framework for automated anomaly detection in high-frequency water-quality data from in situ sensors, using turbidity, conductivity and river level data collected from rivers flowing into the Great Barrier Reef. After identifying end-user needs and defining anomalies, we ranked anomaly importance and selected suitable detection methods. High priority anomalies included sudden isolated spikes and level shifts, most of which were classified correctly by regression-based methods such as autoregressive integrated moving average models. However, incorporation of multiple water-quality variables as covariates reduced performance due to complex relationships among variables. Classifications of drift and periods of anomalously low or high variability were more often correct when we applied mitigation, which replaces anomalous measurements with forecasts for further forecasting, but this inflated false positive rates. Feature-based methods also performed well on high priority anomalies and were similarly less proficient at detecting lower priority anomalies, resulting in high false negative rates. Unlike regression-based methods, however, all feature-based methods produced low false positive rates and have the benefit of not requiring training or optimization. Rule-based methods successfully detected a subset of lower priority anomalies, specifically impossible values and missing observations. We therefore suggest that a combination of methods will provide optimal performance in terms of correct anomaly detection, whilst minimizing false detection rates. Furthermore, our framework emphasizes the importance of communication between end-users and anomaly detection developers for optimal outcomes with respect to both detection performance and end-user application. To this end, our framework has high transferability to other types of high frequency time-series data and anomaly detection applications.",
keywords = "Big data, Forecasting, Near-real time, Quality control and assurance, River, Time series",
author = "Catherine Leigh and Omar Alsibai and Hyndman, {Rob J.} and Sevvandi Kandanaarachchi and King, {Olivia C.} and McGree, {James M.} and Catherine Neelamraju and Jennifer Strauss and Talagala, {Priyanga Dilini} and Turner, {Ryan D.R.} and Kerrie Mengersen and Peterson, {Erin E.}",
year = "2019",
month = "5",
day = "10",
doi = "10.1016/j.scitotenv.2019.02.085",
language = "English",
volume = "664",
pages = "885--898",
journal = "Science of the Total Environment",
issn = "0048-9697",
publisher = "Elsevier",

}

Leigh, C, Alsibai, O, Hyndman, RJ, Kandanaarachchi, S, King, OC, McGree, JM, Neelamraju, C, Strauss, J, Talagala, PD, Turner, RDR, Mengersen, K & Peterson, EE 2019, 'A framework for automated anomaly detection in high frequency water-quality data from in situ sensors', Science of the Total Environment, vol. 664, pp. 885-898. https://doi.org/10.1016/j.scitotenv.2019.02.085

A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. / Leigh, Catherine; Alsibai, Omar; Hyndman, Rob J.; Kandanaarachchi, Sevvandi; King, Olivia C.; McGree, James M.; Neelamraju, Catherine; Strauss, Jennifer; Talagala, Priyanga Dilini; Turner, Ryan D.R.; Mengersen, Kerrie; Peterson, Erin E.

In: Science of the Total Environment, Vol. 664, 10.05.2019, p. 885-898.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - A framework for automated anomaly detection in high frequency water-quality data from in situ sensors

AU - Leigh, Catherine

AU - Alsibai, Omar

AU - Hyndman, Rob J.

AU - Kandanaarachchi, Sevvandi

AU - King, Olivia C.

AU - McGree, James M.

AU - Neelamraju, Catherine

AU - Strauss, Jennifer

AU - Talagala, Priyanga Dilini

AU - Turner, Ryan D.R.

AU - Mengersen, Kerrie

AU - Peterson, Erin E.

PY - 2019/5/10

Y1 - 2019/5/10

N2 - Monitoring the water quality of rivers is increasingly conducted using automated in situ sensors, enabling timelier identification of unexpected values or trends. However, the data are confounded by anomalies caused by technical issues, for which the volume and velocity of data preclude manual detection. We present a framework for automated anomaly detection in high-frequency water-quality data from in situ sensors, using turbidity, conductivity and river level data collected from rivers flowing into the Great Barrier Reef. After identifying end-user needs and defining anomalies, we ranked anomaly importance and selected suitable detection methods. High priority anomalies included sudden isolated spikes and level shifts, most of which were classified correctly by regression-based methods such as autoregressive integrated moving average models. However, incorporation of multiple water-quality variables as covariates reduced performance due to complex relationships among variables. Classifications of drift and periods of anomalously low or high variability were more often correct when we applied mitigation, which replaces anomalous measurements with forecasts for further forecasting, but this inflated false positive rates. Feature-based methods also performed well on high priority anomalies and were similarly less proficient at detecting lower priority anomalies, resulting in high false negative rates. Unlike regression-based methods, however, all feature-based methods produced low false positive rates and have the benefit of not requiring training or optimization. Rule-based methods successfully detected a subset of lower priority anomalies, specifically impossible values and missing observations. We therefore suggest that a combination of methods will provide optimal performance in terms of correct anomaly detection, whilst minimizing false detection rates. Furthermore, our framework emphasizes the importance of communication between end-users and anomaly detection developers for optimal outcomes with respect to both detection performance and end-user application. To this end, our framework has high transferability to other types of high frequency time-series data and anomaly detection applications.

AB - Monitoring the water quality of rivers is increasingly conducted using automated in situ sensors, enabling timelier identification of unexpected values or trends. However, the data are confounded by anomalies caused by technical issues, for which the volume and velocity of data preclude manual detection. We present a framework for automated anomaly detection in high-frequency water-quality data from in situ sensors, using turbidity, conductivity and river level data collected from rivers flowing into the Great Barrier Reef. After identifying end-user needs and defining anomalies, we ranked anomaly importance and selected suitable detection methods. High priority anomalies included sudden isolated spikes and level shifts, most of which were classified correctly by regression-based methods such as autoregressive integrated moving average models. However, incorporation of multiple water-quality variables as covariates reduced performance due to complex relationships among variables. Classifications of drift and periods of anomalously low or high variability were more often correct when we applied mitigation, which replaces anomalous measurements with forecasts for further forecasting, but this inflated false positive rates. Feature-based methods also performed well on high priority anomalies and were similarly less proficient at detecting lower priority anomalies, resulting in high false negative rates. Unlike regression-based methods, however, all feature-based methods produced low false positive rates and have the benefit of not requiring training or optimization. Rule-based methods successfully detected a subset of lower priority anomalies, specifically impossible values and missing observations. We therefore suggest that a combination of methods will provide optimal performance in terms of correct anomaly detection, whilst minimizing false detection rates. Furthermore, our framework emphasizes the importance of communication between end-users and anomaly detection developers for optimal outcomes with respect to both detection performance and end-user application. To this end, our framework has high transferability to other types of high frequency time-series data and anomaly detection applications.

KW - Big data

KW - Forecasting

KW - Near-real time

KW - Quality control and assurance

KW - River

KW - Time series

UR - http://www.scopus.com/inward/record.url?scp=85061392804&partnerID=8YFLogxK

U2 - 10.1016/j.scitotenv.2019.02.085

DO - 10.1016/j.scitotenv.2019.02.085

M3 - Article

VL - 664

SP - 885

EP - 898

JO - Science of the Total Environment

JF - Science of the Total Environment

SN - 0048-9697

ER -