Dealing with Missing Data

Dealing with Missing Data Challenges and Solutions Nicole Erler Department of Biostatistics, Erasmus Medical Center nerler@erasmusmcnl N_Erler wwwnerlercom NErler 13 January 2020

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [] ## Residual standard error: 2305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 009255, Adjusted R-squared: 002679 ## F-statistic: 1407 on 5 and 69 DF, p-value: 02325 1

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [] ## Residual standard error: 2305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 009255, Adjusted R-squared: 002679 ## F-statistic: 1407 on 5 and 69 DF, p-value: 02325 Imputation is super easy: library("mice") imp <- mice(mydata) However 1

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality violation bias 2

Imputation??? Remind me, how did that imputation thing work again??? 3

Imputation Imputation filling in missing values with (good) "guesses" 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values uncertainty This needs to be taken into account!!! 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values uncertainty This needs to be taken into account!!! Donald Rubin (in the 1970s): Represent each missing value with multiple imputed values Multiple Imputation Note: Imputation is not the only approach to handle missing values (Also: maximum likelihood, inverse probability weighting, ) 4

Multiple Imputation incomplete data multiple imputed datasets analysis results pooled results 1 Imputation: impute multiple times multiple completed datasets 2 Analysis: analyse each of the datasets 3 Pooling: combine results, taking into account additional uncertainty 5

Imputation Step Two main approaches Joint Model Multiple Imputation the "original" approach often using a multivariate normal distribution 6

Imputation Step Two main approaches Joint Model Multiple Imputation the "original" approach often using a multivariate normal distribution Multiple Imputation with Chained Equations (MICE) also: Fully Conditional Specification (FCS) now often considered the gold standard 6

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals x 1 x 2 x 3 x 4 NA NA NA NA NA NA 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals x 1 x 2 x 3 x 4 NA NA NA NA NA NA x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals For example: x 1 x 2 x 3 x 4 NA NA NA NA NA NA x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + linear regression logistic regression 7

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, until convergence x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, until convergence x 1 x 2 x 3 x 4 NA NA NA NA NA NA Values from last iteration one imputed dataset 8

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 9

Missing Data Mechanisms Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data questionnaire got lost in mail Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data Missing At Random (MAR) p(r X obs, X mis ) = p(r X obs ) Missingness depends only on observed data questionnaire got lost in mail overweight participants are less likely to report their chocolate consumption (and we know their weight) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data Missing At Random (MAR) p(r X obs, X mis ) = p(r X obs ) Missingness depends only on observed data questionnaire got lost in mail overweight participants are less likely to report their chocolate consumption (and we know their weight) Missing Not At Random (MNAR) p(r X obs, X mis ) p(r X obs ) Missingness depends (also) on unobserved data overweight participants are less likely to report their weight 10

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality In case of MNAR: MICE bias 11

Imputation Model Misspecification x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + For example: linear regression logistic regression 13

Imputation Model Misspecification x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + For example: linear regression logistic regression count count incomplete covariate incomplete covariate incomplete covariate covariate 13

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure 14

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure Partial solutions: Predictive Mean Matching Passive imputation 14

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure Partial solutions: Predictive Mean Matching Passive imputation But can get tedious requires knowledge (about data & methods) users often inexperienced and/or lazy 14

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality Model misspecification bias 15

Compatibility & Congeniality Compatibility A joint distribution exists, that has the full conditionals (imputation models) as its conditional distributions 17

Compatibility & Congeniality Compatibility A joint distribution exists, that has the full conditionals (imputation models) as its conditional distributions Congeniality The imputation model is compatible with the analysis model 17

Compatibility & Congeniality in MICE MICE is based on the idea of Gibbs sampling Exploits the fact that a joint distribution is fully determined by its full conditional distributions 18

Compatibility & Congeniality in MICE MICE is based on the idea of Gibbs sampling Exploits the fact that a joint distribution is fully determined by its full conditional distributions But: In MICE Imputation models are specified directly no guarantee that a corresponding joint distribution exists 18

Compatibility & Congeniality in MICE joint distribution Gibbs MICE full conditionals 19

Compatibility & Congeniality in MICE joint distribution Gibbs MICE full conditionals Is this a problem? often not but it can be when imputation/analysis models contradict each other different assumptions are made during analysis and imputation the outcome cannot easily be included in the imputation models 19

Example 1: Contradicting Models Analysis model with a quadratic association: y = β 0 + β 1 x + β 2 x 2 + x y observed missing 20

Example 1: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y +, ie, a linear relation between x and y is assumed x y observed missing imputed 21

Example 1: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y +, ie, a linear relation between x and y is assumed x y fit on complete fit on imputed observed missing imputed 21

Example 2: Contradicting Models Analysis model with interaction term: y = β 0 + β x x + β z z + β xz xz +, ie, y again has a non-linear relationship with x x y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) 22

Example 2: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y + θ 12 z +, x y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) imputed (z = 0) imputed (z = 1) 23

Example 2: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y + θ 12 z +, y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) imputed (z = 0) imputed (z = 1) true imputed x 23

Example 3: Longitudinal / Multi-level Data id time y x 1 034 012 1 065-004 1 068 030 1 197 044 1 238 048 1 309 046 2 211 043 NA 2 372 046 NA 2 382 046 NA 2 413 029 NA 24

Example 3: Longitudinal / Multi-level Data Imputation in long format rows are treated as independent imputations in baseline covariates will vary over time bias Can we use data in wide format (one row per subject)? can be very inefficient not always possible id time y x 1 034 012 1 065-004 1 068 030 1 197 044 1 238 048 1 309 046 2 211 043 NA 2 372 046 NA 2 382 046 NA 2 413 029 NA 25

Example 3: Longitudinal / Multi-level Data y y time time y y time time 26

Compatibility & Congeniality in MICE Lack of compatibility / congeniality can become a problem for MICE in settings with Non-linear associations non-linear effects interaction terms complex outcomes multi-level settings time-to-event outcomes What can we do in these settings? 27

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! New problem: What is the multivariate distribution of multiple variables of different types? 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! New problem: What is the multivariate distribution of multiple variables of different types? Usually, the joint distribution is not of any known form 28

Joint Model Imputation Multivariate Normal Model Approximate the joint distribution by a known multivariate (usually normal) distribution this is Joint Model Multiple Imputation assures compatibility & congeniality can t handle non-linear associations 29

Joint Model Imputation Multivariate Normal Model Approximate the joint distribution by a known multivariate (usually normal) distribution this is Joint Model Multiple Imputation assures compatibility & congeniality can t handle non-linear associations Sequential Factorization Factorize the joint distribution into (a sequence of) conditional distributions assures compatibility & congeniality can handle non-linear associations 29

Sequential Factorization A joint distribution p(y, x) can be written as the product of conditional distributions: p(y, x) = p(y x) p(x) (or alternatively p(y, x) = p(x y) p(y)) 30

Sequential Factorization A joint distribution p(y, x) can be written as the product of conditional distributions: p(y, x) = p(y x) p(x) (or alternatively p(y, x) = p(x y) p(y)) This can be extended for more variables: p(y, x 1,, x p ) = p(y x 1,, x p ) p(x 1 x 2,, x p ) p(x 2 x 3,, x p ) p(x p ) 30

Sequential Factorization in the Bayesian Framework Joint Distribution p(y, X, θ) = p(y X, θ) }{{} analysis model p(x θ) }{{} imputation part θ contains regr coefficients, variance parameters, p(θ) }{{} priors 31

Sequential Factorization in the Bayesian Framework Joint Distribution p(y, X, θ) = p(y X, θ) }{{} analysis model p(x θ) }{{} imputation part θ contains regr coefficients, variance parameters, Imputation part p(θ) }{{} priors X {}}{ p( x 1,, x p, X compl θ) = p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) 31

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors 32

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors Extension for a Time-to-Event Outcome p(t, D X, θ) }{{} analysis model p(x θ) }{{} imputation part p(θ) }{{} priors 32

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors Extension for a Time-to-Event Outcome p(t, D X, θ) }{{} analysis model Extension for a Multivariate Outcome p(y 1, y 2 X, θ) } {{ } analysis model p(x θ) }{{} imputation part p(x θ) }{{} imputation part p(θ) }{{} priors p(θ) }{{} priors 32

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) 33

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) No issues with complex outcomes, eg: multi-level survival non-linear effects congeniality compatibility 33

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) Analysis model part of specification parameters of interest directly available no need for pooling simultaneous analysis and imputation 34

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI 35

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI Bayesian analysis of incomplete data using (generalized) linear regression (generalized) linear mixed models ordinal (mixed) models parametric (Weibull) time-to-event models Cox proportional hazards models 35

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI Bayesian analysis of incomplete data using (generalized) linear regression (generalized) linear mixed models ordinal (mixed) models parametric (Weibull) time-to-event models Cox proportional hazards models on CRAN: https://cranr-projectorg/package=jointai GitHub: https://githubcom/nerler/jointai website: https://nerlergithubio/jointai/ 35

Joint Analysis and Imputation in standard regression mixed model type outcome covariate outcome covariate normal lognormal (soon) (soon) Gamma beta (soon) (soon) (soon) binomial poisson (soon) ordinal multinomial (soon) (soon) (soon) Available soon: Joint models (of longitudinal & time-to-event data) Multivariate models 36

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) 37

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) Installation: installpackages("jointai") 37

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) Installation: installpackages("jointai") Usage: library("jointai") res <- lm_imp(sbp ~ age + gender + smoke + occup, data = NHANES, niter = 300) 37

JointAI: How does it work? traceplot(res) 38

JointAI: How does it work? summary(res) ## ## Linear model fitted with JointAI ## ## Call: ## lm_imp(formula = SBP ~ age + gender + smoke + occup, data = NHANES, ## niter = 300) ## ## Posterior summary: ## Mean SD 25% 975% tail-prob GR-crit ## (Intercept) 106222 33979 99461 112961 00000 100 ## age 0427 00798 0278 0583 00000 100 ## genderfemale -7450 22718-11755 -3072 00000 100 ## smokeformer -6692 30297-12342 -0885 00267 103 ## smokecurrent -2658 30229-8450 3313 03711 101 ## occuplooking for work 3817 64037-9487 16087 05044 101 ## occupnot working -0869 26858-6110 4256 07511 102 ## ## Posterior summary of residual std deviation: ## Mean SD 25% 975% GR-crit ## sigma_sbp 143 0753 128 158 0999 ## ## ## MCMC settings ## [] 39

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model non-parametric Bayesian methods 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model non-parametric Bayesian methods semi-parametric methods 40

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! JointAI aims to facilitate correct handling of missing values by assuring compatibility & congeniality simultaneous analysis & imputation especially for complex settings 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! JointAI aims to facilitate correct handling of missing values by assuring compatibility & congeniality simultaneous analysis & imputation especially for complex settings There is no magical solution that will always work in all settings 41

Thank you for your attention nerler@erasmusmcnl N_Erler NErler wwwnerlercom