Понятие случайности и проблема пропусков даных в социологии
The paper describes a recent study aimed at investigating the most efficient data imputation algorithm for several methods of data analysis such as regression modeling, factor analysis, descriptive statistics, and correlation analysis. The lack of recommendations when choosing the data imputation algorithm poses the problem of choice ambiguity in each situation.
The authors consider that the data imputation algorithm should be selected according to the method employed after data improvement. In other words, it is believed that for each data analysis method the efficiency of the same data imputation algorithm is different. The statistical experiment was used to evaluate the efficiency of several data imputation algorithms for each method of data analysis.
The core idea of statistical experiment was to compare the results of each method application used in the etalon data set (without missing values) with the results obtained on a large number of artificial subsamples generated from the original data set where missing values were filed with comparable data imputation algorithms.
Generation of subsamples was carried out via the bootstrap procedure, which allowed to undertake
statistical evaluation and to build confidence intervals for each parameter before and after the data imputation.
Through this experiment the authors managed to evaluate the efficiency of such data imputation algorithms as imputation with the average trend measures, the EM algorithm, the imputation via regression model and Hot Deck algorithm for the mentioned methods of data analysis.
Multiple imputation is an approach to missing data elimination created by Donald Rubin. The purpose of multiple imputation is to reconstruct the initial structure of data, i.e. to generate the answers as close as possible to hypothetical complete dataset. However, the original algorithm of multiple imputation is complicated and demands a major amount of effort to accomplish. In the study simpler alternative approach –averaging of imputed values – was experimentally tested against Rubin’s rule in a number of common research situations. We compared two approaches to multiple imputation results aggregation – Rubin’s rule and averaging of imputed values – considering given analytical tools, share of missing values and type of the variable that contains missing values. The results were summed up in a set of recommendations describing a pertinent approach to aggregation for each research situation.