How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment
The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.