A comparison of model selection methods for prediction in the presence of multiply imputed data

Le Thi Phuong Thao, Ronald Geskus

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥ 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.

Original languageEnglish
Pages (from-to)343-356
Number of pages14
JournalBiometrical Journal
Volume61
Issue number2
DOIs
Publication statusPublished - Mar 2019
Externally publishedYes

Keywords

  • lasso
  • multiply imputed data
  • prediction
  • stacked data
  • variable selection

Cite this

@article{097a0f3643bb4d9095cfd8c9278ce868,
title = "A comparison of model selection methods for prediction in the presence of multiply imputed data",
abstract = "Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥ 50{\%}) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50{\%} of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.",
keywords = "lasso, multiply imputed data, prediction, stacked data, variable selection",
author = "Thao, {Le Thi Phuong} and Ronald Geskus",
year = "2019",
month = "3",
doi = "10.1002/bimj.201700232",
language = "English",
volume = "61",
pages = "343--356",
journal = "Biometrical Journal",
issn = "0323-3847",
publisher = "Wiley-VCH Verlag GmbH & Co. KGaA",
number = "2",

}

A comparison of model selection methods for prediction in the presence of multiply imputed data. / Thao, Le Thi Phuong; Geskus, Ronald.

In: Biometrical Journal, Vol. 61, No. 2, 03.2019, p. 343-356.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - A comparison of model selection methods for prediction in the presence of multiply imputed data

AU - Thao, Le Thi Phuong

AU - Geskus, Ronald

PY - 2019/3

Y1 - 2019/3

N2 - Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥ 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.

AB - Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥ 50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.

KW - lasso

KW - multiply imputed data

KW - prediction

KW - stacked data

KW - variable selection

UR - http://www.scopus.com/inward/record.url?scp=85055531742&partnerID=8YFLogxK

U2 - 10.1002/bimj.201700232

DO - 10.1002/bimj.201700232

M3 - Article

VL - 61

SP - 343

EP - 356

JO - Biometrical Journal

JF - Biometrical Journal

SN - 0323-3847

IS - 2

ER -