The paper describes a recent study aimed at investigating the most efficient data imputation algorithm for several methods of data analysis such as regression modeling, factor analysis, descriptive statistics, and correlation analysis. The lack of recommendations when choosing the data imputation algorithm poses the problem of choice ambiguity in each situation.
The authors consider that the data imputation algorithm should be selected according to the method employed after data improvement. In other words, it is believed that for each data analysis method the efficiency of the same data imputation algorithm is different. The statistical experiment was used to evaluate the efficiency of several data imputation algorithms for each method of data analysis.
The core idea of statistical experiment was to compare the results of each method application used in the etalon data set (without missing values) with the results obtained on a large number of artificial subsamples generated from the original data set where missing values were filed with comparable data imputation algorithms.
Generation of subsamples was carried out via the bootstrap procedure, which allowed to undertake
statistical evaluation and to build confidence intervals for each parameter before and after the data imputation.
Through this experiment the authors managed to evaluate the efficiency of such data imputation algorithms as imputation with the average trend measures, the EM algorithm, the imputation via regression model and Hot Deck algorithm for the mentioned methods of data analysis.