A hybrid of two novel methods - additive fuzzy spectral clustering and lifting method over a taxonomy - is applied to analyse the research activities of a department. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is applicable also to other taxonomies. Clusters of the taxonomy subjects are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted then to higher ranks of the taxonomy by minimizing the count of “head subjects” along with their “gaps” and “offshoots”. An example is given illustrating the method applied to real-world data.
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"? Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve. In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
Cohen et al. developed an O(log n)-approximation algorithm for minimizing the total hub label size (l1 norm). We give O(log n)- approximation algorithms for the problems of minimizing the maximum label (l∞ norm) and minimizing lp and lq norms simultaneously.
The paper studies the efficiency of nine state-of-the-art algorithms for scheduling of workflow applications in heterogeneous computing systems (HCS). The comparison of algorithms is performed on the base of discrete-event simulation for a wide range of workflow and system configurations. The developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results. The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms.
The research presented in this paper has been conducted in the framework of the large sociolinguistic project aimed at describing everyday spoken Russian and analyzing the special characteristics of its usage by different social groups of speakers. The research is based on the material of the ORD corpus containing long-term audio recordings of everyday communication. The aim of the given exploratory study is to reveal the linguistic parameters, in terms of which the difference in speech between different social groups is the most evident. An exploratory subcorpus, consisting of audio fragments of spoken communication of 12 respondents (6 men and 6 women, 4 representatives for each age group, and representatives of different professional and status groups) with the total duration of 106 min and of similar communication settings, was created and fully annotated. The quantitative description of a number of linguistic parameters on phonetic, lexical, morphological, and syntax levels in each social group was made. The biggest difference between social groups was observed in speech rate, phonetic reduction, lexical preferences, and syntactic irregularities. The study has shown that the differences between age groups are more significant than between gender groups, and the speech of young people differs most strongly from the others.
A digraph G = (V,E) with a distinguished set T ⊆ V of terminals is called inner Eulerian if for each v ∈ V − T the numbers of arcs entering and leaving v are equal. By a T-path we mean a simple directed path connecting distinct terminals with all intermediate nodes in V −T. This paper concerns the problem of finding a maximum T-path packing, i.e. a maximum collection of arc-disjoint T-paths. A min-max relation for this problem was established by Lomonosov. The capacitated version was studied by Ibaraki, Karzanov, and Nagamochi, who came up with a strongly-polynomial algorithm of complexity O(φ(V,E) ・ log T +V 2E) (hereinafter φ(n,m) denotes the complexity of a max-flow computation in a network with n nodes and m arcs). For unit capacities, the latter algorithm takes O(φ(V,E) ・ log T +V E) time, which is unsatisfactory since a max-flow can be found in o(V E) time. For this case, we present an improved method that runs in O(φ(V,E) ・ log T + E log V ) time. Thus plugging in the max-flow algorithm of Dinic, we reduce the overall complexity from O(V E) to O(min(V 2/3E,E3/2) ・ log T).
The insufficient performance of statistical recognition of composite objects (images, speech signals) is explored in case of medium-sized database (thousands of classes). In contrast to heuristic approximate nearest-neighbor methods we propose a statistically optimal greedy algorithm. The decision is made based on the Kullback-Leibler minimum information discrimination principle. The model object to be checked at the next step is selected from the class with the maximal likelihood (joint density) of distances to previously checked models. Experimental study results in face recognition task with FERET dataset are presented. It is shown that the proposed method is much more effective than the brute force and fast approximate nearest neighbor algorithms, such as randomized kd-tree, perm-sort, directed enumeration method.
Concept lattices built on noisy data tend to be large and hence hard to interpret. We introduce several measures that can be used in selecting relevant concepts and discuss how they can be combined together. We study their performance in a series of experiments.
A disjunctive model of box bicluster and tricluster analysis is considered. A least-squares locally-optimal one cluster method is proposed, oriented towards the analysis of binary data. The method involves a parameter, the scale shift, and is proven to lead to ”contrast” box biand tri-clusters. An experimental study of the method is reported.
We present a new concept of biclique as a tool for preimage attacks, which employs many powerful techniques from differential cryptanalysis of block ciphers and hash functions. The new tool has proved to be widely applicable by inspiring many authors to publish new results of the full versions of AES, KASUMI, IDEA, and Square. In this paper, we show how our concept leads to the first cryptanalysis of the round-reduced Skein hash function, and describe an attack on the SHA-2 hash function with more rounds than before.
Finite state transducers over semigroups can be regarded as a formal model of sequential reactive programs. In this paper we introduce a uniform tech- nique for checking eectively functionality, k-valuedness, equivalence and inclusion for this model of computation in the case when a semigroup these transducers op- erate over is embeddable in a decidable group.
Let G = (V,E) be a digraph with disjoint sets of sources S ⊂ V and sinks T ⊂ V endowed with an S–T flow f : E → Z+. It is a well-known fact that f decomposes into a sum_st(fst) of s–t flows fst between all pairs of sources s ∈ S and sinks t ∈ T . In the usual RAM model, such a decomposition can be found in O(E log V 2 E ) time. The present paper concerns the complexity of this problem in the external memory model (introduced by Aggarwal and Vitter). The internal memory algorithm involves random memory access and thus becomes inefficient. We propose two novel methods. The first one requires O(Sort(E) log V 2 E ) I/Os and the second one takes O(Sort(E) log U) expected I/Os (where U denotes the maximum value of f).