Dealing with Missing Data

Relaterede dokumenter
Vina Nguyen HSSP July 13, 2008

X M Y. What is mediation? Mediation analysis an introduction. Definition

Basic statistics for experimental medical researchers

DoodleBUGS (Hands-on)

Generalized Probit Model in Design of Dose Finding Experiments. Yuehui Wu Valerii V. Fedorov RSU, GlaxoSmithKline, US

Linear Programming ١ C H A P T E R 2

Project Step 7. Behavioral modeling of a dual ported register set. 1/8/ L11 Project Step 5 Copyright Joanne DeGroat, ECE, OSU 1

Skriftlig Eksamen Kombinatorik, Sandsynlighed og Randomiserede Algoritmer (DM528)

Black Jack --- Review. Spring 2012

IBM Network Station Manager. esuite 1.5 / NSM Integration. IBM Network Computer Division. tdc - 02/08/99 lotusnsm.prz Page 1

Privat-, statslig- eller regional institution m.v. Andet Added Bekaempelsesudfoerende: string No Label: Bekæmpelsesudførende

Den nye Eurocode EC Geotenikerdagen Morten S. Rasmussen

PARALLELIZATION OF ATTILA SIMULATOR WITH OPENMP MIGUEL ÁNGEL MARTÍNEZ DEL AMOR MINIPROJECT OF TDT24 NTNU

A multimodel data assimilation framework for hydrology

Statistik for MPH: 7

Statistik for MPH: oktober Attributable risk, bestemmelse af stikprøvestørrelse (Silva: , )

The X Factor. Målgruppe. Læringsmål. Introduktion til læreren klasse & ungdomsuddannelser Engelskundervisningen

Portal Registration. Check Junk Mail for activation . 1 Click the hyperlink to take you back to the portal to confirm your registration

Engelsk. Niveau D. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

Subject to terms and conditions. WEEK Type Price EUR WEEK Type Price EUR WEEK Type Price EUR WEEK Type Price EUR

Hvor er mine runde hjørner?

Resource types R 1 1, R 2 2,..., R m CPU cycles, memory space, files, I/O devices Each resource type R i has W i instances.

Skriftlig Eksamen Diskret matematik med anvendelser (DM72)

Kvant Eksamen December timer med hjælpemidler. 1 Hvad er en continuous variable? Giv 2 illustrationer.

Shooting tethered med Canon EOS-D i Capture One Pro. Shooting tethered i Capture One Pro 6.4 & 7.0 på MAC OS-X & 10.8

SOFTWARE PROCESSES. Dorte, Ida, Janne, Nikolaj, Alexander og Erla

LESSON NOTES Extensive Reading in Danish for Intermediate Learners #8 How to Interview

Agenda. The need to embrace our complex health care system and learning to do so. Christian von Plessen Contributors to healthcare services in Denmark

Engelsk. Niveau C. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

Department of Public Health. Case-control design. Katrine Strandberg-Larsen Department of Public Health, Section of Social Medicine

Particle-based T-Spline Level Set Evolution for 3D Object Reconstruction with Range and Volume Constraints

How Long Is an Hour? Family Note HOME LINK 8 2

Module I: Statistical Background on Multi-level Models

KA 4.2 Kvantitative Forskningsmetoder Forår 2010

User Manual for LTC IGNOU

Forslag til implementering af ResearcherID og ORCID på SCIENCE

Molio specifications, development and challenges. ICIS DA 2019 Portland, Kim Streuli, Molio,

Brug sømbrættet til at lave sjove figurer. Lav fx: Få de andre til at gætte, hvad du har lavet. Use the nail board to make funny shapes.

DK - Quick Text Translation. HEYYER Net Promoter System Magento extension

Introduction Ronny Bismark

Trolling Master Bornholm 2013

On the complexity of drawing trees nicely: corrigendum

what is this all about? Introduction three-phase diode bridge rectifier input voltages input voltages, waveforms normalization of voltages voltages?

Userguide. NN Markedsdata. for. Microsoft Dynamics CRM v. 1.0

Besvarelser til Lineær Algebra Reeksamen Februar 2017

Overlevelse efter AMI. Hvilken betydning har følgende faktorer for risikoen for ikke at overleve: Køn og alder betragtes som confoundere.

Remember the Ship, Additional Work

Bilag. Resume. Side 1 af 12

ECE 551: Digital System * Design & Synthesis Lecture Set 5

Financial Literacy among 5-7 years old children

GUIDE TIL BREVSKRIVNING

Business Opening. Very formal, recipient has a special title that must be used in place of their name

Business Opening. Very formal, recipient has a special title that must be used in place of their name

RoE timestamp and presentation time in past

Ikke-parametriske tests

E-PAD Bluetooth hængelås E-PAD Bluetooth padlock E-PAD Bluetooth Vorhängeschloss

Unitel EDI MT940 June Based on: SWIFT Standards - Category 9 MT940 Customer Statement Message (January 2004)

Trolling Master Bornholm 2016 Nyhedsbrev nr. 6

CHAPTER 8: USING OBJECTS

Vores mange brugere på musskema.dk er rigtig gode til at komme med kvalificerede ønsker og behov.

Must I be a registered company in Denmark? That is not required. Both Danish and foreign companies can trade at Gaspoint Nordic.

Trolling Master Bornholm 2016 Nyhedsbrev nr. 5

Aktivering af Survey funktionalitet

Mandara. PebbleCreek. Tradition Series. 1,884 sq. ft robson.com. Exterior Design A. Exterior Design B.

IPv6 Application Trial Services. 2003/08/07 Tomohide Nagashima Japan Telecom Co., Ltd.

Observation Processes:

Nyhedsmail, december 2013 (scroll down for English version)

Managing stakeholders on major projects. - Learnings from Odense Letbane. Benthe Vestergård Communication director Odense Letbane P/S

ATEX direktivet. Vedligeholdelse af ATEX certifikater mv. Steen Christensen

Erik Parner Sektion for Biostatistik. Biostatistisk metode et par eksempler

Reexam questions in Statistics and Evidence-based medicine, august sem. Medis/Medicin, Modul 2.4.

DENCON ARBEJDSBORDE DENCON DESKS

Mandara. PebbleCreek. Tradition Series. 1,884 sq. ft robson.com. Exterior Design A. Exterior Design B.

DANSK INSTALLATIONSVEJLEDNING VLMT500 ADVARSEL!

Velkommen til IFF QA erfa møde d. 15. marts Erfaringer med miljømonitorering og tolkning af nyt anneks 1.

The GAssist Pittsburgh Learning Classifier System. Dr. J. Bacardit, N. Krasnogor G53BIO - Bioinformatics

United Nations Secretariat Procurement Division

Cross-Sectorial Collaboration between the Primary Sector, the Secondary Sector and the Research Communities

Implementing SNOMED CT in a Danish region. Making sharable and comparable nursing documentation

Differential Evolution (DE) "Biologically-inspired computing", T. Krink, EVALife Group, Univ. of Aarhus, Denmark

Advanced Statistical Computing Week 5: EM Algorithm

Den uddannede har viden om: Den uddannede kan:

Engineering of Chemical Register Machines

Sikkerhedsvejledning

CS 4390/5387 SOFTWARE V&V LECTURE 5 BLACK-BOX TESTING - 2

Som mentalt og moralsk problem

LED STAR PIN G4 BASIC INFORMATION: Series circuit. Parallel circuit HOW CAN I UNDERSTAND THE FOLLOWING SHEETS?

Sign variation, the Grassmannian, and total positivity

Titel: Barry s Bespoke Bakery

Our activities. Dry sales market. The assortment

Applications. Computational Linguistics: Jordan Boyd-Graber University of Maryland RL FOR MACHINE TRANSLATION. Slides adapted from Phillip Koehn

Generelle lineære modeller

Status på det trådløse netværk

Forventer du at afslutte uddannelsen/har du afsluttet/ denne sommer?

The River Underground, Additional Work

How consumers attributions of firm motives for engaging in CSR affects their willingness to pay

Barnets navn: Børnehave: Kommune: Barnets modersmål (kan være mere end et)

The effects of occupant behaviour on energy consumption in buildings

Logistisk Regression - fortsat

Richter 2013 Presentation Mentor: Professor Evans Philosophy Department Taylor Henderson May 31, 2013

Transkript:

Dealing with Missing Data Challenges and Solutions Nicole Erler Department of Biostatistics, Erasmus Medical Center nerler@erasmusmcnl N_Erler wwwnerlercom NErler 13 January 2020

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [] ## Residual standard error: 2305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 009255, Adjusted R-squared: 002679 ## F-statistic: 1407 on 5 and 69 DF, p-value: 02325 1

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [] ## Residual standard error: 2305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 009255, Adjusted R-squared: 002679 ## F-statistic: 1407 on 5 and 69 DF, p-value: 02325 Imputation is super easy: library("mice") imp <- mice(mydata) However 1

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality violation bias 2

Imputation??? Remind me, how did that imputation thing work again??? 3

Imputation Imputation filling in missing values with (good) "guesses" 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values uncertainty This needs to be taken into account!!! 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values uncertainty This needs to be taken into account!!! Donald Rubin (in the 1970s): Represent each missing value with multiple imputed values Multiple Imputation Note: Imputation is not the only approach to handle missing values (Also: maximum likelihood, inverse probability weighting, ) 4

Multiple Imputation incomplete data multiple imputed datasets analysis results pooled results 1 Imputation: impute multiple times multiple completed datasets 2 Analysis: analyse each of the datasets 3 Pooling: combine results, taking into account additional uncertainty 5

Imputation Step Two main approaches Joint Model Multiple Imputation the "original" approach often using a multivariate normal distribution 6

Imputation Step Two main approaches Joint Model Multiple Imputation the "original" approach often using a multivariate normal distribution Multiple Imputation with Chained Equations (MICE) also: Fully Conditional Specification (FCS) now often considered the gold standard 6

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals x 1 x 2 x 3 x 4 NA NA NA NA NA NA 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals x 1 x 2 x 3 x 4 NA NA NA NA NA NA x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables: }{{} full conditionals For example: x 1 x 2 x 3 x 4 NA NA NA NA NA NA x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + linear regression logistic regression 7

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, until convergence x 1 x 2 x 3 x 4 NA NA NA NA NA NA 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: start with initial guess update x 1 based on initial values of x 2, x 3, x 4, update x 2 based on new x 1 and initial values of x 3, x 4, update x 1 again, based on updated x 2, x 3, x 4, until convergence x 1 x 2 x 3 x 4 NA NA NA NA NA NA Values from last iteration one imputed dataset 8

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 9

Missing Data Mechanisms Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data questionnaire got lost in mail Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data Missing At Random (MAR) p(r X obs, X mis ) = p(r X obs ) Missingness depends only on observed data questionnaire got lost in mail overweight participants are less likely to report their chocolate consumption (and we know their weight) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p(r X obs, X mis ) = p(r) Missingness is independent of all data Missing At Random (MAR) p(r X obs, X mis ) = p(r X obs ) Missingness depends only on observed data questionnaire got lost in mail overweight participants are less likely to report their chocolate consumption (and we know their weight) Missing Not At Random (MNAR) p(r X obs, X mis ) p(r X obs ) Missingness depends (also) on unobserved data overweight participants are less likely to report their weight 10

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality In case of MNAR: MICE bias 11

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 12

Imputation Model Misspecification x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + For example: linear regression logistic regression 13

Imputation Model Misspecification x 1 x 2 + x 3 + x 4 + x 2 x 1 + x 3 + x 4 + x 3 x 1 + x 2 + x 4 + x 4 x 1 + x 2 + x 3 + For example: linear regression logistic regression count count incomplete covariate incomplete covariate incomplete covariate covariate 13

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure 14

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure Partial solutions: Predictive Mean Matching Passive imputation 14

Imputation Model Misspecification count count incomplete covariate incomplete covariate incomplete covariate covariate misspecification of the residual distribution misspecification of the association structure Partial solutions: Predictive Mean Matching Passive imputation But can get tedious requires knowledge (about data & methods) users often inexperienced and/or lazy 14

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality Model misspecification bias 15

MICE Makes Assumptions (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution (eg normal) all associations are linear compatibility and congeniality 16

Compatibility & Congeniality Compatibility A joint distribution exists, that has the full conditionals (imputation models) as its conditional distributions 17

Compatibility & Congeniality Compatibility A joint distribution exists, that has the full conditionals (imputation models) as its conditional distributions Congeniality The imputation model is compatible with the analysis model 17

Compatibility & Congeniality in MICE MICE is based on the idea of Gibbs sampling Exploits the fact that a joint distribution is fully determined by its full conditional distributions 18

Compatibility & Congeniality in MICE MICE is based on the idea of Gibbs sampling Exploits the fact that a joint distribution is fully determined by its full conditional distributions But: In MICE Imputation models are specified directly no guarantee that a corresponding joint distribution exists 18

Compatibility & Congeniality in MICE joint distribution Gibbs MICE full conditionals 19

Compatibility & Congeniality in MICE joint distribution Gibbs MICE full conditionals Is this a problem? often not but it can be when imputation/analysis models contradict each other different assumptions are made during analysis and imputation the outcome cannot easily be included in the imputation models 19

Example 1: Contradicting Models Analysis model with a quadratic association: y = β 0 + β 1 x + β 2 x 2 + x y observed missing 20

Example 1: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y +, ie, a linear relation between x and y is assumed x y observed missing imputed 21

Example 1: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y +, ie, a linear relation between x and y is assumed x y fit on complete fit on imputed observed missing imputed 21

Example 2: Contradicting Models Analysis model with interaction term: y = β 0 + β x x + β z z + β xz xz +, ie, y again has a non-linear relationship with x x y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) 22

Example 2: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y + θ 12 z +, x y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) imputed (z = 0) imputed (z = 1) 23

Example 2: Contradicting Models Imputation model for x (when using MICE naively): x = θ 10 + θ 11 y + θ 12 z +, y missing (z = 0) missing (z = 1) observed (z = 0) observed (z = 1) imputed (z = 0) imputed (z = 1) true imputed x 23

Example 3: Longitudinal / Multi-level Data id time y x 1 034 012 1 065-004 1 068 030 1 197 044 1 238 048 1 309 046 2 211 043 NA 2 372 046 NA 2 382 046 NA 2 413 029 NA 24

Example 3: Longitudinal / Multi-level Data Imputation in long format rows are treated as independent imputations in baseline covariates will vary over time bias Can we use data in wide format (one row per subject)? can be very inefficient not always possible id time y x 1 034 012 1 065-004 1 068 030 1 197 044 1 238 048 1 309 046 2 211 043 NA 2 372 046 NA 2 382 046 NA 2 413 029 NA 25

Example 3: Longitudinal / Multi-level Data y y time time y y time time 26

Compatibility & Congeniality in MICE Lack of compatibility / congeniality can become a problem for MICE in settings with Non-linear associations non-linear effects interaction terms complex outcomes multi-level settings time-to-event outcomes What can we do in these settings? 27

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! New problem: What is the multivariate distribution of multiple variables of different types? 28

Imputation in Complex Settings Remember, the problem is joint distribution Gibbs MICE full conditionals Solution: Start with the joint distribution! New problem: What is the multivariate distribution of multiple variables of different types? Usually, the joint distribution is not of any known form 28

Joint Model Imputation Multivariate Normal Model Approximate the joint distribution by a known multivariate (usually normal) distribution this is Joint Model Multiple Imputation assures compatibility & congeniality can t handle non-linear associations 29

Joint Model Imputation Multivariate Normal Model Approximate the joint distribution by a known multivariate (usually normal) distribution this is Joint Model Multiple Imputation assures compatibility & congeniality can t handle non-linear associations Sequential Factorization Factorize the joint distribution into (a sequence of) conditional distributions assures compatibility & congeniality can handle non-linear associations 29

Sequential Factorization A joint distribution p(y, x) can be written as the product of conditional distributions: p(y, x) = p(y x) p(x) (or alternatively p(y, x) = p(x y) p(y)) 30

Sequential Factorization A joint distribution p(y, x) can be written as the product of conditional distributions: p(y, x) = p(y x) p(x) (or alternatively p(y, x) = p(x y) p(y)) This can be extended for more variables: p(y, x 1,, x p ) = p(y x 1,, x p ) p(x 1 x 2,, x p ) p(x 2 x 3,, x p ) p(x p ) 30

Sequential Factorization in the Bayesian Framework Joint Distribution p(y, X, θ) = p(y X, θ) }{{} analysis model p(x θ) }{{} imputation part θ contains regr coefficients, variance parameters, p(θ) }{{} priors 31

Sequential Factorization in the Bayesian Framework Joint Distribution p(y, X, θ) = p(y X, θ) }{{} analysis model p(x θ) }{{} imputation part θ contains regr coefficients, variance parameters, Imputation part p(θ) }{{} priors X {}}{ p( x 1,, x p, X compl θ) = p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) 31

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors 32

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors Extension for a Time-to-Event Outcome p(t, D X, θ) }{{} analysis model p(x θ) }{{} imputation part p(θ) }{{} priors 32

Sequential Factorization in the Bayesian Framework Extension for a Multi-level Setting p(y X, b, θ) p(x θ) }{{}}{{} analysis imputation model part p(b θ) }{{} random effects p(θ) }{{} priors Extension for a Time-to-Event Outcome p(t, D X, θ) }{{} analysis model Extension for a Multivariate Outcome p(y 1, y 2 X, θ) } {{ } analysis model p(x θ) }{{} imputation part p(x θ) }{{} imputation part p(θ) }{{} priors p(θ) }{{} priors 32

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) 33

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) No issues with complex outcomes, eg: multi-level survival non-linear effects congeniality compatibility 33

MICE vs Sequential Factorization Imputation in MICE p(x 1 y, X compl, x 2, x 3, x 4,, θ) p(x 2 y, X compl, x 1, x 3, x 4,, θ) p(x 3 y, X compl, x 1, x 2, x 4,, θ) Sequential Factorization p(y X compl, x 1, x 2, x 3,, θ) p(x 1 X compl, θ) p(x 2 X compl, x 1, θ) p(x 3 X compl, x 1, x 2, θ) Analysis model part of specification parameters of interest directly available no need for pooling simultaneous analysis and imputation 34

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI 35

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI Bayesian analysis of incomplete data using (generalized) linear regression (generalized) linear mixed models ordinal (mixed) models parametric (Weibull) time-to-event models Cox proportional hazards models 35

Joint Analysis and Imputation in Sequential Factorization is implemented in the package JointAI Bayesian analysis of incomplete data using (generalized) linear regression (generalized) linear mixed models ordinal (mixed) models parametric (Weibull) time-to-event models Cox proportional hazards models on CRAN: https://cranr-projectorg/package=jointai GitHub: https://githubcom/nerler/jointai website: https://nerlergithubio/jointai/ 35

Joint Analysis and Imputation in standard regression mixed model type outcome covariate outcome covariate normal lognormal (soon) (soon) Gamma beta (soon) (soon) (soon) binomial poisson (soon) ordinal multinomial (soon) (soon) (soon) Available soon: Joint models (of longitudinal & time-to-event data) Multivariate models 36

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) 37

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) Installation: installpackages("jointai") 37

JointAI: How does it work? Requirements: (https://cranr-projectorg/) JAGS (Just Another Gibbs Sampler; https://sourceforgenet/projects/mcmc-jags/files/jags/4x/) Installation: installpackages("jointai") Usage: library("jointai") res <- lm_imp(sbp ~ age + gender + smoke + occup, data = NHANES, niter = 300) 37

JointAI: How does it work? traceplot(res) 38

JointAI: How does it work? summary(res) ## ## Linear model fitted with JointAI ## ## Call: ## lm_imp(formula = SBP ~ age + gender + smoke + occup, data = NHANES, ## niter = 300) ## ## Posterior summary: ## Mean SD 25% 975% tail-prob GR-crit ## (Intercept) 106222 33979 99461 112961 00000 100 ## age 0427 00798 0278 0583 00000 100 ## genderfemale -7450 22718-11755 -3072 00000 100 ## smokeformer -6692 30297-12342 -0885 00267 103 ## smokecurrent -2658 30229-8450 3313 03711 101 ## occuplooking for work 3817 64037-9487 16087 05044 101 ## occupnot working -0869 26858-6110 4256 07511 102 ## ## Posterior summary of residual std deviation: ## Mean SD 25% 975% GR-crit ## sigma_sbp 143 0753 128 158 0999 ## ## ## MCMC settings ## [] 39

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model non-parametric Bayesian methods 40

What is left to do? (Imputation) methods make certain assumptions, eg: missingness is M(C)AR the incomplete variable has a certain conditional distribution all associations are linear compatibility and congeniality extension to MNAR using pattern mixture model non-parametric Bayesian methods semi-parametric methods 40

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! JointAI aims to facilitate correct handling of missing values by assuring compatibility & congeniality simultaneous analysis & imputation especially for complex settings 41

Take-Home Message handling missing values correctly: not that easy all methods have assumptions violation bias good use of (imputation) methods requires knowledge of the data knowledge of the methods knowledge of the software time & patience! JointAI aims to facilitate correct handling of missing values by assuring compatibility & congeniality simultaneous analysis & imputation especially for complex settings There is no magical solution that will always work in all settings 41

Thank you for your attention nerler@erasmusmcnl N_Erler NErler wwwnerlercom