How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment
The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.
Global tendency to democratization of the last decades is interested by a lot of researchers and politicians. While the significant part of the studies is dedicated to indices’ validity is concentrated on measurement that are used in broad cross-country surveys, Professor of Oxford University Stein Ringen pays attention to the insufficient consideration of system characteristics of democracy and suggests to investigate individual people’ perception the level of democracy in their own country (at the same time he doesn’t offer any empirical database). In this research the author by means of regression and correlation analysis concludes that not all indices of democracy can be called valid, in particular, the most invalid index is the widespread index Polity IV.
It is commonly the case in multi-modal pattern recognition that certain modality-specific object features are missing in the training set. We address here the missing data problem for kernel-based Support Vector Machines, in which each modality is represented by the respective kernel matrix over the set of training objects, such that the omission of a modality for some object manifests itself as a blank in the modality-specific kernel matrix at the relevant position. We propose to fill the blank positions in the collection of training kernel matrices via a variant of the Neutral Point Substitution (NPS) method, where the term ”neutral point” stands for the locus of points defined by the ”neutral hyperplane” in the hypothetical linear space produced by the respective kernel. The current method crucially differs from the previously developed neutral point approach in that it is capable of treating missing data in the training set on the same basis as missing data in the test set. It is therefore of potentially much wider applicability. We evaluate the method on the Biosecure DS2 data set.
This book is a collection of articles that present the most recent cutting edge results on specification and estimation of economic models written by a number of the world’s foremost leaders in the fields of theoretical and methodological econometrics. Recent advances in asymptotic approximation theory, including the use of higher order asymptotics for things like estimator bias correction, and the use of various expansion and other theoretical tools for the development of bootstrap techniques designed for implementation when carrying out inference are at the forefront of theoretical development in the field of econometrics. One important feature of these advances in the theory of econometrics is that they are being seamlessly and almost immediately incorporated into the “empirical toolbox” that applied practitioners use when actually constructing models using data, for the purposes of both prediction and policy analysis and the more theoretically targeted chapters in the book will discuss these developments. Turning now to empirical methodology, chapters on prediction methodology will focus on macroeconomic and financial applications, such as the construction of diffusion index models for forecasting with very large numbers of variables, and the construction of data samples that result in optimal predictive accuracy tests when comparing alternative prediction models. Chapters carefully outline how applied practitioners can correctly implement the latest theoretical refinements in model specification in order to “build” the best models using large-scale and traditional datasets, making the book of interest to a broad readership of economists from theoretical econometricians to applied economic practitioners.
The authors analyzed the population life quality of some regions in Russian Federation with using of multivariate statistical analysis. The authors found that increasing population life quality, in particular, increasing life expectancy can be achieved by adjusting the demographic indicators, cash income, development of health, social and environmental security in the Volga Federal District. While in the municipalities of the Republic of Mari El the growth of employment, wages, migration and natural population growth, the number of doctors and the number of inputs houses, reducing proportion of dilapidated housing and reduced mortality improved the quality o f life and increase fertility.
This book concentrates on in-depth explanation of a few methods to address core issues, rather than presentation of a multitude of methods that are popular among the scientists. An added value of this edition is that I am trying to address two features of the brave new world that materialized after the first edition was written in 2010. These features are the emergence of “Data science” and changes in student cognitive skills in the process of global digitalization. The birth of Data science gives me more opportunities in delineating the field of data analysis. An overwhelming majority of both theoreticians and practition-ers are inclined to consider the notions of ‘data analysis” (DA) and “machine learning” (ML) as synonymous. There are, however, at least two differences between the two. First comes the difference in perspectives. ML is to equip computers with methods and rules to see through regularities of the environment - and behave accordingly. DA is to enhance conceptual understanding. These goals are not inconsistent indeed, which explains a huge overlap between DA and ML. However, there are situations in which these perspectives are not consistent. Regarding the current students’ cognitive habits, I came to the conclusion that they prefer to immediately get into the “thick of it”. Therefore, I streamlined the presentation of multidimensional methods. These methods are now organized in four Chapters, one of which presents correlation learning (Chapter 3). Three other Chapters present summarization methods both quantitative (Chapter 2) and categorical (Chapters 4 and 5). Chapter 4 relates to finding and characterizing partitions by using K-means clustering and its extensions. Chapter 5 relates to hierarchical and separative cluster structures. Using encoder-decoder data recovery approach brings forth a number of mathematically proven interrelations between methods that are used for addressing such practical issues as the analysis of mixed scale data, data standardization, the number of clusters, cluster interpretation, etc. An obvious bias towards summarization against correlation can be explained, first, by the fact that most texts in the field are biased in the opposite direction, and, second, by my personal preferences. Categorical summarization, that is, clustering is considered not just a method of DA but rather a model of classification as a concept in knowledge engineering. Also, in this edition, I somewhat relaxed the “presentation/formulation/computation” narrative struc-ture, which was omnipresent in the first edition, to be able do things in one go. Chapter 1 presents the author’s view on the DA mainstream, or core, as well as on a few Data science issues in general. Specifically, I bring forward novel material on the role of DA, including its successes and pitfalls (Section 1.4), and classification as a special form of knowledge (Section 1.5). Overall, my goal is to show the reader that Data science is not a well-formed part of knowledge yet but rather a piece of science-in-the-making.
The article explores the procedural aspect of constructing structural and logical typologies with the aim of creating the innovation index - workers attitudes guiding innovation and innovation -related behavior at workplace.
This paper presents a preliminary analysis of hotel room prices in several European cities based on the data from Booking.com website. The main question raised in the study is whether early booking is advantageous indeed, and if so, how early should it be? First a script was developed to download more than 600 thousand hotel offers for reservations from 25 March 2013 to 17 March 2014. Then an attempt to discover more details concerning the early booking effect was made via basic statistics, graphical data representation and hedonic pricing analysis. It was revealed that making reservations in advance can be really gainful, although more data and research are needed to measure the exact numbers, as they depend on at least seasonality and city.