How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

S. Zhuchkova; A. Rotmistrov

doi:10.1007/s11135-021-01114-w

Publications

?

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Quality and Quantity. 2022. Vol. 56. No. 1. P. 1–22.

Zhuchkova S., Rotmistrov A.

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

Research target: Sociology (including Demography and Anthropology

Priority areas: sociology

Keywords: пропущенные данные регрессионный анализ regression analysis статистический эксперимент missing data множественная импутация случайный лес multiple imputation statistical experiment categorical data категориальные данные метод индикаторной переменной анализ полных наблюдений complete case analysis missing-indicator method random forest

Publication based on the results of:

Комплексное сравнение методов обработки пропущенных данных в социологических исследованиях (2021)

Имеет ли метод индикаторной переменной преимущества перед анализом полных наблюдений при обработке пропусков в категориальных регрессорах?

Zhuchkova S., Rotmistrov A., Shabanova E., Мониторинг общественного мнения: Экономические и социальные перемены 2021 № 4 С. 23–52

If missingness is encountered in a categorical regressor, which approach is preferable: complete case analysis or the missing-indicator method? The former approach implies including in analysis (linear regression in our research) only the cases without missingness across analyzed variables. This approach is embedded in many statistical applications by default, and despite the opinion that its ...

Added: December 12, 2020

A Comparison of the Missing-Indicator Method and Complete Case Analysis in Case of Categorical Data

Zhuchkova S., Rotmistrov A., / Series WP BRP "Sociology". 2019.

The research aims to provide a complex analysis of the missing-indicator method’s performance in case of a categorical independent variable in regression in comparison with complete case analysis. While the latter seems to be the most popular way to handle missing data, the former appears to be a simple and effective alternative that allows making ...

Added: October 2, 2019

Возможность работы с пропущенными данными при использовании CHAID: результаты статистического эксперимента

Zhuchkova S., Rotmistrov A., Социология: методология, методы, математическое моделирование 2018 № 46 С. 85–122

The paper is addressed to an approach to working with a missing data "as is". I.e. it is supposed that missing data becomes one more category of the exploring variable. Such an approach to working with missings is radically different from alternative approaches: they are to delete those observations which contain missings or replace missings ...

Added: September 17, 2018

Применение многоуровневого регрессионного моделирования к межстрановым данным (на примере генерализованного доверия)

Vólchenko O., Shirokanova A., Социология: методология, методы, математическое моделирование 2016 № 43 С. 7–62

The paper deals with multilevel regression modelling (MLM) as a method preferred to the ordinary least-squares regression in the analysis of comparative data with hierarchical data structure. We present substantive reasons (contextual sources of heterogeneity, causal heterogeneity, and generalisability of results) and statistical reasons (obtaining more precise and reliable estimates) for multilevel modelling. We also ...

Added: August 9, 2017

Инновативное поведение на работе - опыт построения социологического индекса

Климова С. Г., Galitskaya E., Galitsky E., Вестник Института социологии 2010 № 1 С. 328–351

The article explores the procedural aspect of constructing structural and logical typologies with the aim of creating the innovation index - workers attitudes guiding innovation and innovation -related behavior at workplace. ...

Added: November 12, 2012

Модель-ориентированный подход к отсутствующим значениям: множественная импутация в многоуровневой регрессии посредством R (на примере анализа опросных данных по гордости страной)

Fabrykant M., Социология: методология, методы, математическое моделирование 2015 № 41 С. 7–29

The article substantiates and describes the multiple imputation technique and its procedure of dealing with missing values in dataset. It presents the model-oriented approach as opposed to design-oriented approach to analyzing survey data. The theory section lists the benefits of multiple imputation and states why it should be preferred to other ways of dealing with ...

Added: June 3, 2016

Подходы к агрегированию результатов множественного заполнения пропусков: сравнительный анализ

Zangieva I., Suleimanova A., Социология: методология, методы, математическое моделирование 2016 № 42 С. 7–60

Multiple imputation is an approach to missing data elimination created by Donald Rubin. The purpose of multiple imputation is to reconstruct the initial structure of data, i.e. to generate the answers as close as possible to hypothetical complete dataset. However, the original algorithm of multiple imputation is complicated and demands a major amount of effort ...

Added: March 2, 2017

Социальные медиа: о чем и кому пишут их пользователи? Некоторые подходы к анализу данных

Kotyrlo E., Прикладная эконометрика 2017 № 3 С. 74–99

Study of users and their segmentation, based on users’ preferred topics of discussion and their networking, is the unique opportunity offered by social networks. Variety of approaches to social media analysis based on social network analysis and text mining is summarized in the paper. It is extended by concentration index application and visualizing of the ...

Added: October 20, 2017

Роль семьи в качестве канала межпоколенческой передачи традиций волонтёрства в современной России

Mersiyanova I. V., Malakhov D., Ivanova N., Экономическая социология 2019 Т. 20 № 3 С. 66–89

The paper focuses on the role of family in forming the consistency of vol-unteering traditions in contemporary Russia. The paper investigated the correlation between parental volunteering and the current volunteering of their children. International studies indicate that family impact on chil-dren’s attitude towards volunteering is a significant channel of intergenera-tional transmission of prosocial behavioral patterns. ...

Added: June 5, 2019

Кто виноват и что делать: анализ комментариев к фильмам о бесланском теракте

Smirnov N., Социальная психология и общество 2021 Т. 12 № 3 С. 74–86

Objective. Analysis of YouTube comments to documentaries about the fifteenth anniversary of the Beslan school siege. Background. Amid reactivation of the Beslan discourse, demand for social reflection of the tragedy is increasing. It seems relevant to address nonreactive data to evaluate framing perception. Study design. Using random forest modeling we examine the content of YouTube ...

Added: December 30, 2021

Sensitivity of Goodness-of-Fit Indices to Lack of Measurement Invariance with Categorical Indicators and Many Groups

Sokolov B., / NRU Higher School of Economics. Series SOC "Sociology". 2019. No. 86.

Using Monte Carlo simulation experiments, this paper examines the performance of popular SEM goodness-of-fit indices, namely CFI, TLI, RMSEA, and SRMR, with respect to a specific task of measurement invariance testing with categorical data and many groups (10-50 groups). Study factors include the number of groups, the level of non-invariance in the data, and the ...

Added: July 15, 2019

A method for reclassifying cause of death in cases categorized as “event of undetermined intent”

Andreev E. M., Shkolnikov V., Pridemore W. A. et al., Population Health Metrics 2015 Vol. 13 No. 23 P. 1–25

Background: We present a method for reclassifying external causes of death categorized as “event of undetermined intent” (EUIs) into non-transport accidents, suicides, or homicides. In nations like Russia and the UK the absolute number of EUIs is large, the EUI death rate is high, or EUIs comprise a non-trivial proportion of all deaths due to ...

Added: September 8, 2015

Смертность по уровню образования в России

Pyankova A., Fattakhov T., Экономический журнал Высшей школы экономики 2017 Т. 21 № 4 С. 623–647

Earlier papers revealed educational differences in mortality in Russia in 1970’s–1980’s were at least as significant as in Western countries, and were largely similar to those observed in Eastern Bloc. Starting from 1998 there is a little knowledge about socio-economic characteristics of mortality since data collection has been discontinued and resumed only in 2011. Contemporary ...

Added: December 6, 2017

Методы классификации текстовых данных: можно ли потенциал количественного анализа использовать в качественном исследовании?

Aleksandrova M., ИНТЕРакция. ИНТЕРвью. ИНТЕРпретация 2021 Т. 13 № 2 С. 81–96

Text mining has developed rapidly in recent years. In this article, we compare classification methods that are suitable for solving problems of predicting item nonresponse. The author builds reasoning about how the analysis of textual data can be implemented in a wider research field based on this material. The author considers a number of metrics ...

Added: August 20, 2021

Регрессионные модели в оценке факторов международной миграции

Inglehart R. F., Ponarin E., Равлик М. В., Социологические исследования 2014 № 11 С. 23–33

Country-level variables that may significantly influence inflows of migrants in the world are analyzed by means of regression analysis. In particular, we find significant influence of non-economic factors (especially, education) on migration flows. The database in the form of bilateral migration flows for each country was constructed on the basis of the UN data on ...

Added: October 8, 2014

Russlands doppelte Sozialstruktur Ressourcenverteilung in der Ständegesellschaft

Simon Kordonsky, Dmitrij Dechant, Ol’ga Moljarenko, Osteuropa 2013 No. 4 P. 107–114

Russian society has a dual structure. On the one hand, the state has created formal professional castes, to which it allocates resources based on the principle of welfare-state distributive justice. On the other hand, there exist informal “corporations”. Politics in Russia revolves around the conflict between these two principles of order. On one side are ...

Added: June 27, 2013

Социальная активность пожилых россиян и перспективы реализации политики «активного старения»

Kuchmaeva O., Население и экономика 2018 Т. 2 № 4 С. 47–84

Статья посвящена анализу масштабов и видов социальной активности пожилых россиян. Актуальность данной проблемы обусловлена масштабами демографического старения, ставшего уже предметом целенаправленной социальной политики в России и странах мира. Реализуя Международный план действий по проблемам старения, в частности, в рамках Стратегии действий в интересах граждан старшего поколения в Российской Федерации до 2025 г., важно понимать, насколько сейчас ...

Added: October 30, 2019

До и после тюрьмы. Женские истории

СПб.: Алетейя, 2012.

Коллективная монография написана по материалам социологического качественного исследования женщин, имеющих опыт заключения в исправительных учреждениях. Основные темы, которые раскрываются в книге, - это формальная и неформальные структуры власти, гендерные аспекты быта и повседневности в российских женских колониях, а также специфика построения биографических нарративов осужденных женщин. Кроме того, в монографии представлены голоса самих женщин в форме ...

Added: May 14, 2012

Социальное конструирование качества на московском рынке стоматологических услуг

Бердышева Е. С., Экономическая социология 2014 Т. 15 № 5 С. 9–44

The act of purchase and sale cannot be realized without product quality evaluation and price reasonableness analysis. In this respect, the processes of definition, stabilization and institutionalization of categories, revealing the contents of goods become of key importance in the framework of contemporary markets. This article reconstructs categories which operationalize the content and quality of ...

Added: March 2, 2015

Что в возрасте тебе моем?

Kashnitsky I. S., Демоскоп Weekly 2014 № 581-582

An analisys of various characteristics of Russian population's demographic structure is given in this article. We use Census 2010 data. The key feature of the presented research is in the chosen subregional level of administrative division. The scale allows us to catch the intraregional differences in demographic structure. Cartographic approach with elements of visual analisys ...

Added: January 21, 2014

Социальная структурация в транзитивном пространстве российского мегаполиса

Ilyin V., Мир России: Социология, этнология 2010 Т. 19 № 1 С. 89–125

Данная статья, представляющая собой переработку глав, не вошедших в книгу «Быт и бытие молодежи российского мегаполиса», посвящена описательному анализу отдельных видов социальной структурации, происходящей в некоторых частях транзитивного социального пространства - на улице и в общественном транспорте. Эмпирическое исследование выполнено с помощью наблюдения (включенного и формализованного) и глубинных интервью в Санкт-Петербурге, а также на основе ...

Added: March 30, 2013

Экспериментальные методы исследования коррупции в экономических и социологических науках

Kosals L., Belianin A. V., Бобкова Н. В. et al., Экономическая социология 2014 Т. 15 № 1 С. 61–88

Corruption as social phenomenon is studied by a variety of disciplines ― anthropology, criminology, development theory, economics, political science, psychology, sociology. Each of them has developed its own scientific traditions associated with different methods of collecting and analyzing empirical data, such as participant observation in anthropology or sociological interviews.In this paper we describe experimental methods, ...

Added: June 7, 2015

Динамика некоторых показателей общего человеческого капитала россиян в 2010–2015 гг.

Tikhonova N. E., Каравай А. В., Социологические исследования 2018 № 5 С. 84–98

The article analyzes dynamics of some indicators of the human capital of Russians in the context of changes in the relations between employees and employers. It is shown that the general vector of these changes is the shift of balance of power between them in favor of employers, which has led to non-compliance with the ...

Added: August 29, 2018

Рецензия на книгу: Ibos Caroline Qui gardera nos enfants? Les nounouset les meres: une enquette de Caroline Ibos. Paris: Flammаrion, 2012

Demintseva E., Этнографическое обозрение 2013 № 5 С. 198–201

A review of a book by French sociologist Carolin Ibos based on the the results of an ethnographic study of female immigrants from sub-Saharan Africa in Paris ...

Added: November 15, 2013