A hybrid of two novel methods - additive fuzzy spectral clustering and lifting method over a taxonomy - is applied to analyse the research activities of a department. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is applicable also to other taxonomies. Clusters of the taxonomy subjects are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted then to higher ranks of the taxonomy by minimizing the count of “head subjects” along with their “gaps” and “offshoots”. An example is given illustrating the method applied to real-world data.
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"? Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve. In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
Cohen et al. developed an O(log n)-approximation algorithm for minimizing the total hub label size (l1 norm). We give O(log n)- approximation algorithms for the problems of minimizing the maximum label (l∞ norm) and minimizing lp and lq norms simultaneously.
The paper discusses plural forms of Russian nouns (in particular, of the surnames) like vsjakie tam Ivanovy (‘various Ivanovs’, ‘all sorts of Ivanovs’), expressing negative opinion about the referents. The co-occurrence patterns of such Pl.Pej forms by the web-corpus data is revealed. Pl.Pej forms foremost fit together with universal quantifiers including ‘all’, ‘all of these’ etc., and can be easily integrate in quantificational expressions, e.g., combinations with numerals, collective nouns, and expressions that include number words like mnogo (‘many’). These elements are able to convey and support the meaning of multiplicity, non-uniqueness of the objects, denoted by forms of Pl.Pej. Among the usages of Pl.Pej the names of “oligarchs” and “right-wing, liberal politicians” predominate. The form mainly appears in heavily politicized texts. The studied form and co-occurrence patterns are a legacy of the Soviet socio-political discourse and originate from the language of Soviet newspapers. The Pl.Pej form is still a part of an aggressive leftist discourse, directed against a “group of the rich”. The addressant of such discourse is a representative of a “group of the poor, oppressed, socially humiliated”.
Unlike most papers devoted to improvements of code-based cryptosystem, where original Goppa codes are substituted by some other codes, we suggest a new method of strengthening which is code-independent. We show (up to some limit) that the security of the new code-based cryptosystem is much closer to the hardness of maximum likelihood decoding than in the original McEliece cryptosystem.
The paper studies the efficiency of nine state-of-the-art algorithms for scheduling of workflow applications in heterogeneous computing systems (HCS). The comparison of algorithms is performed on the base of discrete-event simulation for a wide range of workflow and system configurations. The developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results. The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms.
The research presented in this paper has been conducted in the framework of the large sociolinguistic project aimed at describing everyday spoken Russian and analyzing the special characteristics of its usage by different social groups of speakers. The research is based on the material of the ORD corpus containing long-term audio recordings of everyday communication. The aim of the given exploratory study is to reveal the linguistic parameters, in terms of which the difference in speech between different social groups is the most evident. An exploratory subcorpus, consisting of audio fragments of spoken communication of 12 respondents (6 men and 6 women, 4 representatives for each age group, and representatives of different professional and status groups) with the total duration of 106 min and of similar communication settings, was created and fully annotated. The quantitative description of a number of linguistic parameters on phonetic, lexical, morphological, and syntax levels in each social group was made. The biggest difference between social groups was observed in speech rate, phonetic reduction, lexical preferences, and syntactic irregularities. The study has shown that the differences between age groups are more significant than between gender groups, and the speech of young people differs most strongly from the others.
A digraph G = (V,E) with a distinguished set T ⊆ V of terminals is called inner Eulerian if for each v ∈ V − T the numbers of arcs entering and leaving v are equal. By a T-path we mean a simple directed path connecting distinct terminals with all intermediate nodes in V −T. This paper concerns the problem of finding a maximum T-path packing, i.e. a maximum collection of arc-disjoint T-paths. A min-max relation for this problem was established by Lomonosov. The capacitated version was studied by Ibaraki, Karzanov, and Nagamochi, who came up with a strongly-polynomial algorithm of complexity O(φ(V,E) ・ log T +V 2E) (hereinafter φ(n,m) denotes the complexity of a max-flow computation in a network with n nodes and m arcs). For unit capacities, the latter algorithm takes O(φ(V,E) ・ log T +V E) time, which is unsatisfactory since a max-flow can be found in o(V E) time. For this case, we present an improved method that runs in O(φ(V,E) ・ log T + E log V ) time. Thus plugging in the max-flow algorithm of Dinic, we reduce the overall complexity from O(V E) to O(min(V 2/3E,E3/2) ・ log T).
The insufficient performance of statistical recognition of composite objects (images, speech signals) is explored in case of medium-sized database (thousands of classes). In contrast to heuristic approximate nearest-neighbor methods we propose a statistically optimal greedy algorithm. The decision is made based on the Kullback-Leibler minimum information discrimination principle. The model object to be checked at the next step is selected from the class with the maximal likelihood (joint density) of distances to previously checked models. Experimental study results in face recognition task with FERET dataset are presented. It is shown that the proposed method is much more effective than the brute force and fast approximate nearest neighbor algorithms, such as randomized kd-tree, perm-sort, directed enumeration method.
Concept lattices built on noisy data tend to be large and hence hard to interpret. We introduce several measures that can be used in selecting relevant concepts and discuss how they can be combined together. We study their performance in a series of experiments.
A disjunctive model of box bicluster and tricluster analysis is considered. A least-squares locally-optimal one cluster method is proposed, oriented towards the analysis of binary data. The method involves a parameter, the scale shift, and is proven to lead to ”contrast” box biand tri-clusters. An experimental study of the method is reported.
Theoretical analysis in [1] suggested that adversarially trained generative models are naturally inclined to learn distribution with low support. In particular, this effect is caused by the limited capacity of the discriminator network. To verify this claim, [2] proposed a statistical test based on the birthday paradox that partially confirmed the analysis. In this paper, we continue this line of work and develop a parameter-free and straightforward method to estimate the support size of an arbitrary decoder-based generative model. Our approach considers the decoder network from a geometric viewpoint and evaluates the support size as the volume of the manifold containing the generative model samples. Additionally, we propose a method to measure non-uniformity of a generative model that can provide additional insight into the model’s behavior. We then apply these tools to perform a quantitative comparison of common generative models.
This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard "2ch", which has challenging lexical and grammatical features.