Возможность работы с пропущенными данными при использовании CHAID: результаты статистического эксперимента
The paper is addressed to an approach to working with a missing data "as is". I.e. it is supposed that missing data becomes one more category of the exploring variable. Such an approach to working with missings is radically different from alternative approaches: they are to delete those observations which contain missings or replace missings with valid data. The only method known to us which makes it possible to implement the approach of working with missings "as is" is CHAID. CHAID refers to the decision trees class of methods; in itself, this method is very interesting and relevant for researchers dealing with categorical variables and nonlinear associations.
In the literature, we did not find an answer to the question what are the advantages and limitations of the approach to working with missings "as is" implemented in CHAID comparing to the mentioned alternative approaches. Despite this, tree models with missing data are often found in empirical studies. To start a discussion considering this issue, we conducted several series of statistical experiments on generated data organized into three predictors of categorical and interval measure type. It was empirically established that, on the whole, the method correctly distributes missings in tree's nodes, but in most cases, the inclusion of missings in an analysis is accompanied by changes in tree's structure, and therefore there is a risk of obtaining incorrect, false, erroneous conclusions. The paper also provides recommendations on what factors should be considered when deciding whether to include missing in an analysis "as is".
In this study a CHAID-based approach to detecting classification accuracy heterogeneity across segments of observations is proposed. This helps to solve some important problems, facing a model-builder: (1) How to automatically detect segments in which the model significantly underperforms? and (2) How to incorporate the knowledge about classification accuracy heterogeneity across segments to partition observations in order to achieve better predictive accuracy? The approach was applied to churn data from the UCI Repository of Machine Learning Databases. By splitting the data set into four parts, which are based on the decision tree, and building a separate logistic regression scoring model for each segment we increased the accuracy by more than 7 percentage points on the test sample. Significant increase in recall and precision was also observed. It was shown that different segments may have absolutely different churn predictors. Therefore such a partitioning gives a better insight into factors influencing customer behavior.
It is commonly the case in multi-modal pattern recognition that certain modality-specific object features are missing in the training set. We address here the missing data problem for kernel-based Support Vector Machines, in which each modality is represented by the respective kernel matrix over the set of training objects, such that the omission of a modality for some object manifests itself as a blank in the modality-specific kernel matrix at the relevant position. We propose to fill the blank positions in the collection of training kernel matrices via a variant of the Neutral Point Substitution (NPS) method, where the term ”neutral point” stands for the locus of points defined by the ”neutral hyperplane” in the hypothetical linear space produced by the respective kernel. The current method crucially differs from the previously developed neutral point approach in that it is capable of treating missing data in the training set on the same basis as missing data in the test set. It is therefore of potentially much wider applicability. We evaluate the method on the Biosecure DS2 data set.
This research is dedicated to the design of a decision support system for categorization of scientific literature. The purpose of this work is to research possible ways to apply the machine learning algorithms to the automation of manual text categorization. The following stages are considered: preprocessing of raw data, word embedding, model selection, classification model, and software design. At the first stage, in collaboration with VINITI RAS, the training set of 200,000 Russian texts was formed. At the second stage, the word embedding model was justified as Word2Vec vector representation from text matrix by “sum” convolution with dimensionality 1500. At the third stage, the quality of the classifiers was estimated, and the logistic regression algorithm with the highest F1 score (0.94) was selected. And at the final stage, the ATC (Automatic Text Classifier) application, which embeds the results obtained on the previous stages, was developed. The overall application structure was described. It consists of compact program modules that can be replaced or adapted to the incoming text and gain the most classification score.
The article presents a model of optimization of inventory control strategy in terms of risk in the supply chain enterprises meat industry. On study the approach to the transformation of the model under conditions of uncertainty in the model of risk management by using the method of decision tree. Based on the method of decision tree for the corresponding model in terms of risk determine the optimal strategy, which provides a different attitude to risk.
The article discusses the influence of temperament on the academic performance of the first-year students at HSE-Nizhny Novgorod on the example of the Faculty of Informatics, Mathematics and Computer Science (IM&CS). The analyses were done with the help of statistics and educational data mining. The baseline data for the study is information about students, obtained by a survey: the information about temperament, degree of extraversion, stability, and other personality traits of students. The study involved students of the first and second years of the faculty of the IM&CS 2017-2018 academic year. Further, psychological factors affecting the average score and the probability of re-training for students with different temperaments were identified. A certain connection between temperament and academic success, which makes possible the prediction of "risky" students, was found. Various machine learning methods are used: the kNN-method and decision trees. The best results were shown by decision trees. As a result, first-year students are classified into three groups (Good, Medium, Bad) according to the degree of risk of getting academic debt. The practical result of the research was the recommendations to the educational office of the Faculty of IM&CS to pay attention to risky students and assist them in the educational process. After the end of the summer session, the classification results were checked. The article also presents an algorithm for finding risky students, taking temperament into account.
Several approaches to the concept of fatherhood present in Western sociological tradition are analyzed and compared: biological determinism, social constructivism and biosocial theory. The problematics of fatherhood and men’s parental practices is marginalized in modern Russian social research devoted to family and this fact makes the traditional inequality in family relations, when the father’s role is considered secondary compared to that of mother, even stronger. However, in Western critical men’s studies several stages can be outlined: the development of “sex roles” paradigm (biological determinism), the emergence of the hegemonic masculinity concept, inter-disciplinary stage (biosocial theory). According to the approach of biological determinism, the role of a father is that of the patriarch, he continues the family line and serves as a model for his ascendants. Social constructivism looks into man’s functions in the family from the point of view of masculine pressure and establishing hegemony over a woman and children. Biosocial theory aims to unite the biological determinacy of fatherhood with social, cultural and personal context. It is shown that these approaches are directly connected with the level of the society development, marriage and family perceptions, the level of egality of gender order.