Metric selection and anomaly detection for cloud operations using log and metric correlation analysis

Mostafa Farshchi, Jean Guy Schneider, Ingo Weber, John Grundy

    Research output: Contribution to journalArticleResearchpeer-review

    25 Citations (Scopus)


    Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer's use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration - which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations' activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.

    Original languageEnglish
    Pages (from-to)531-549
    Number of pages19
    JournalJournal of Systems and Software
    Publication statusPublished - Mar 2018


    • Anomaly detection
    • Cloud application operations
    • Cloud monitoring
    • Error detection
    • Log analysis
    • Metric selection

    Cite this