TY - JOUR
T1 - Metric selection and anomaly detection for cloud operations using log and metric correlation analysis
AU - Farshchi, Mostafa
AU - Schneider, Jean Guy
AU - Weber, Ingo
AU - Grundy, John
PY - 2018/3
Y1 - 2018/3
N2 - Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer's use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration - which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations' activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.
AB - Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer's use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration - which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations' activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.
KW - Anomaly detection
KW - Cloud application operations
KW - Cloud monitoring
KW - Error detection
KW - Log analysis
KW - Metric selection
UR - http://www.scopus.com/inward/record.url?scp=85016586904&partnerID=8YFLogxK
U2 - 10.1016/j.jss.2017.03.012
DO - 10.1016/j.jss.2017.03.012
M3 - Article
AN - SCOPUS:85016586904
SP - 531
EP - 549
JO - Journal of Systems and Software
JF - Journal of Systems and Software
SN - 0164-1212
ER -