Подходы к агрегированию результатов множественного заполнения пропусков: сравнительный анализ
Multiple imputation is an approach to missing data elimination created by Donald Rubin. The purpose of multiple imputation is to reconstruct the initial structure of data, i.e. to generate the answers as close as possible to hypothetical complete dataset. However, the original algorithm of multiple imputation is complicated and demands a major amount of effort to accomplish. In the study simpler alternative approach –averaging of imputed values – was experimentally tested against Rubin’s rule in a number of common research situations. We compared two approaches to multiple imputation results aggregation – Rubin’s rule and averaging of imputed values – considering given analytical tools, share of missing values and type of the variable that contains missing values. The results were summed up in a set of recommendations describing a pertinent approach to aggregation for each research situation.
It is commonly the case in multi-modal pattern recognition that certain modality-specific object features are missing in the training set. We address here the missing data problem for kernel-based Support Vector Machines, in which each modality is represented by the respective kernel matrix over the set of training objects, such that the omission of a modality for some object manifests itself as a blank in the modality-specific kernel matrix at the relevant position. We propose to fill the blank positions in the collection of training kernel matrices via a variant of the Neutral Point Substitution (NPS) method, where the term ”neutral point” stands for the locus of points defined by the ”neutral hyperplane” in the hypothetical linear space produced by the respective kernel. The current method crucially differs from the previously developed neutral point approach in that it is capable of treating missing data in the training set on the same basis as missing data in the test set. It is therefore of potentially much wider applicability. We evaluate the method on the Biosecure DS2 data set.
Social network data usually contain different types of errors. One of them is missing data due to actor non-response. This can seriously jeopardize the results of analyses if not appropriately treated. The impact of missing data may be more severe in valued networks where not only the presence of a tie is recorded, but also its magnitude or strength. Blockmodeling is a technique for delineating network structure. We focus on an indirect approach suitable for valued networks. Little is known about the sensitivity of valued networks to different types of measurement errors. As it is reasonable to expect that blockmodeling, with its positional outcomes, could be vulnerable to the presence of non-respondents, such errors require treatment. We examine the impacts of seven actor non-response treatments on the positions obtained when indirect blockmodeling is used. The start point for our simulation are networks whose structure is known. Three structures were considered: cohesive subgroups, core-periphery, and hierarchy. The results show that the number of non-respondents, the type of underlying blockmodel structure, and the employed treatment all have an impact on the determined partitions of actors in complex ways. Recommendations for best practices are provided. © 2016 Elsevier B.V.
Background: We present a method for reclassifying external causes of death categorized as “event of undetermined intent” (EUIs) into non-transport accidents, suicides, or homicides. In nations like Russia and the UK the absolute number of EUIs is large, the EUI death rate is high, or EUIs comprise a non-trivial proportion of all deaths due to external causes. Overuse of this category may result in (1) substantially underestimating the mortality rate of deaths due to specific external causes and (2) threats to the validity of studies of the patterns and causes of external deaths and of evaluations of the impact of interventions meant to reduce them.
Methods: We employ available characteristics about the deceased and the event to estimate the most likely cause of death using multinomial logistic regression. We use the set of known non-transport accidents, suicides, and homicides to calculate an mlogit-based linear score and an estimated classification probability (ECP). This ECP is applied to EUIs, with varying levels of minimal classification probability. We also present an optional second step that employs a population-level adjustment to reclassify deaths that remain undetermined (the proportion of which varies based on the minimal classification probability). We illustrate our method by applying it to Russia. Between 2000 and 2011, 521,000 Russian deaths (15 % percent of all deaths from external causes) were categorized as EUIs. We used data from anonymized micro-data on the ~3 million deaths from external causes. Our reclassification model used 10 decedent and event characteristics from the computerized death records.
Results: Results show that during this period about 14 % of non-transport accidents, 13 % of suicides, and 33 % of homicides were officially categorized as EUIs. Our findings also suggest that 2011 levels of non-transport accidents and suicides would have been about 24 % higher and of homicide about 82 % higher than that reported by official vital statistics data.
Conclusions: Overuse of the external cause of death classification “event of undetermined intent” may indicate questionable quality of mortality data on external causes of death. This can have wide-ranging implications for families, medical professionals, the justice system, researchers, and policymakers. With our classification probability set as equal to or higher than 0.75, we were able to reclassify about two-thirds of EUI deaths in our sample. Our optional additional step allowed us to redistribute the remaining unclassified EUIs. Our method can be applied to data from any nation or sub-national population in which the EUI category is employed.
Missing data represent an urgent problem in sociological research. One of the sources of the missing data is an item nonresponse, which can be related to the respondent’s reluctance to answer the question, difficulties that occur during the answering process, or other reasons. The reason for the nonresponse is seen in the method of conducting the survey or in the characteristics of the respondents, and also in the characteristics of the questionnaire itself. This research will show how item nonresponse can be predicted by logistic regression model using European Social Survey data (ESS). Models for predicting rejection answer, no answer, and “don’t know” option were trained based on the textual characteristics of the questions using word frequencies and the word importance metric TF-IDF. All the models obtained were compared with each other in terms of the quality of the predictions can be made with them, in addition, the most important words from questions were divided as to whether they increase or decrease the likelihood of an item nonresponse. In particular, it was revealed that words connected to the sensitive topics lead to an increase in the proportion of an item nonresponse, as well as some words connected to the instruction on how to answer particular question.
The paper is addressed to an approach to working with a missing data "as is". I.e. it is supposed that missing data becomes one more category of the exploring variable. Such an approach to working with missings is radically different from alternative approaches: they are to delete those observations which contain missings or replace missings with valid data. The only method known to us which makes it possible to implement the approach of working with missings "as is" is CHAID. CHAID refers to the decision trees class of methods; in itself, this method is very interesting and relevant for researchers dealing with categorical variables and nonlinear associations.
In the literature, we did not find an answer to the question what are the advantages and limitations of the approach to working with missings "as is" implemented in CHAID comparing to the mentioned alternative approaches. Despite this, tree models with missing data are often found in empirical studies. To start a discussion considering this issue, we conducted several series of statistical experiments on generated data organized into three predictors of categorical and interval measure type. It was empirically established that, on the whole, the method correctly distributes missings in tree's nodes, but in most cases, the inclusion of missings in an analysis is accompanied by changes in tree's structure, and therefore there is a risk of obtaining incorrect, false, erroneous conclusions. The paper also provides recommendations on what factors should be considered when deciding whether to include missing in an analysis "as is".
We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.
We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.
We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.