TY - JOUR
T1 - Search-based fairness testing for regression-based machine learning systems
AU - Perera, Anjana
AU - Aleti, Aldeida
AU - Tantithamthavorn, Chakkrit
AU - Jiarpakdee, Jirayus
AU - Turhan, Burak
AU - Kuhn, Lisa
AU - Walker, Katie
N1 - Funding Information:
Contributors : Mrs Anne Loupis, Cabrini Institute, Melbourne; A/Prof Keith Joe, Monash Art, Design and Architecture, Monash Uni.; A/Prof Michael Ben-Meir, Austin and Cabrini Hospitals, Melbourne; Dr Hamed Akhlaghi, St Vincent?s and Werribee Hospitals, Melbourne; Dr Jennie Hutton, St Vincent?s Hospital, Melbourne; Dr Gabriel Blecher, Monash Medical Centre, Monash Health, Melbourne; Dr Paul Buntine, Box Hill Hospital, Eastern Health, Melbourne; Mrs Amy Sweeny, Gold Coast University Hospital, Bond University. Collaborative Group Author : Australasian College for Emergency Medicine, Clinical Trial Network (ACEM CTN) Contributor Statement: Funding: KW, MBM, KJ, BT; Ethics AL, KW; Clinical site investigators: KW, GB, PB, JH, HA, AS; Data acquisition: AL; Data cleaning: JJ; Project concept: AP, AA, BT, CT, JJ, KW, LK; Methods: AA, AP, BT, CT, JJ; Implementation: AP; Data analysis: AA, AP, JJ; Manuscript Draft: AP, AA, CT, BT, JJ, LK, KW; Manuscript revisions: All authors; AA takes responsibility for the overall manuscript.
Funding Information:
Open Access funding provided by University of Oulu including Oulu University Hospital. The Australian government, Medical Research Future Fund, via Monash Partners, funded this study. Researchers contributed in-kind donations of time. The Cabrini Institute and Monash University provided research infrastructure support. Chakkrit Tantithamthavorn was partially supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE200100941).
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/5
Y1 - 2022/5
N2 - Context: Machine learning (ML) software systems are permeating many aspects of our life, such as healthcare, transportation, banking, and recruitment. These systems are trained with data that is often biased, resulting in biased behaviour. To address this issue, fairness testing approaches have been proposed to test ML systems for fairness, which predominantly focus on assessing classification-based ML systems. These methods are not applicable to regression-based systems, for example, they do not quantify the magnitude of the disparity in predicted outcomes, which we identify as important in the context of regression-based ML systems. Method:: We conduct this study as design science research. We identify the problem instance in the context of emergency department (ED) wait-time prediction. In this paper, we develop an effective and efficient fairness testing approach to evaluate the fairness of regression-based ML systems. We propose fairness degree, which is a new fairness measure for regression-based ML systems, and a novel search-based fairness testing (SBFT) approach for testing regression-based machine learning systems. We apply the proposed solutions to ED wait-time prediction software. Results:: We experimentally evaluate the effectiveness and efficiency of the proposed approach with ML systems trained on real observational data from the healthcare domain. We demonstrate that SBFT significantly outperforms existing fairness testing approaches, with up to 111% and 190% increase in effectiveness and efficiency of SBFT compared to the best performing existing approaches. Conclusion:: These findings indicate that our novel fairness measure and the new approach for fairness testing of regression-based ML systems can identify the degree of fairness in predictions, which can help software teams to make data-informed decisions about whether such software systems are ready to deploy. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of emergency department wait-time prediction use fairness degree and search-based techniques to approximate it.
AB - Context: Machine learning (ML) software systems are permeating many aspects of our life, such as healthcare, transportation, banking, and recruitment. These systems are trained with data that is often biased, resulting in biased behaviour. To address this issue, fairness testing approaches have been proposed to test ML systems for fairness, which predominantly focus on assessing classification-based ML systems. These methods are not applicable to regression-based systems, for example, they do not quantify the magnitude of the disparity in predicted outcomes, which we identify as important in the context of regression-based ML systems. Method:: We conduct this study as design science research. We identify the problem instance in the context of emergency department (ED) wait-time prediction. In this paper, we develop an effective and efficient fairness testing approach to evaluate the fairness of regression-based ML systems. We propose fairness degree, which is a new fairness measure for regression-based ML systems, and a novel search-based fairness testing (SBFT) approach for testing regression-based machine learning systems. We apply the proposed solutions to ED wait-time prediction software. Results:: We experimentally evaluate the effectiveness and efficiency of the proposed approach with ML systems trained on real observational data from the healthcare domain. We demonstrate that SBFT significantly outperforms existing fairness testing approaches, with up to 111% and 190% increase in effectiveness and efficiency of SBFT compared to the best performing existing approaches. Conclusion:: These findings indicate that our novel fairness measure and the new approach for fairness testing of regression-based ML systems can identify the degree of fairness in predictions, which can help software teams to make data-informed decisions about whether such software systems are ready to deploy. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of emergency department wait-time prediction use fairness degree and search-based techniques to approximate it.
KW - Bias
KW - Fairness testing
KW - Machine learning
KW - Search-based software testing
KW - Software fairness
KW - Software testing
UR - http://www.scopus.com/inward/record.url?scp=85127566250&partnerID=8YFLogxK
U2 - 10.1007/s10664-022-10116-7
DO - 10.1007/s10664-022-10116-7
M3 - Article
AN - SCOPUS:85127566250
VL - 27
JO - Empirical Software Engineering
JF - Empirical Software Engineering
SN - 1382-3256
IS - 3
M1 - 79
ER -