Методы машинного обучения в социологическом исследовании: предсказание частичного неответа с использованием наивного байесовского классификатора
Various reasons may cause missing data in social research. The article highlights the non-response errors caused by ignorance, the lack of desire, or difficulty searching for answers to specific questionnaire questions. Predicting item nonresponse, which would help reduce missing data, poses particular concerns. Based on the data from the European Social Survey (UK respondents) this article shows how text mining and machine learning can predict item nonresponse. The study employs the Naive Bayes Classifier, a popular method to predict the class of dependent variables based on textual data. It relies on scientific literature to show how this method performs. The author provides a database combining full wordings of questions, answers, and instructions, and the ESS survey results in the UK. The paper shows how separate models for predicting the occurrence of item nonresponse were trained using the Naive Bayes technique based on the word frequency and TF—IDF weights (their calculations are also provided). The authors evaluated each model for the frequency of error occurrence. As a result, lists of words causing or not causing item nonresponse errors were obtained. The results show that respondents are less likely to answer sensitive questions; certain words related to the procedure of getting an answer to a question can also lead to high levels of item nonresponse.