A hybrid of two novel methods - additive fuzzy spectral clustering and lifting method over a taxonomy - is applied to analyse the research activities of a department. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is applicable also to other taxonomies. Clusters of the taxonomy subjects are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted then to higher ranks of the taxonomy by minimizing the count of “head subjects” along with their “gaps” and “offshoots”. An example is given illustrating the method applied to real-world data.
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"? Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve. In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
Cohen et al. developed an O(log n)-approximation algorithm for minimizing the total hub label size (l1 norm). We give O(log n)- approximation algorithms for the problems of minimizing the maximum label (l∞ norm) and minimizing lp and lq norms simultaneously.
The paper deals with problematic issues of information security in cyber-physical systems. Performance analysis of autonomous objects has been carried out. An information security monitoring system model based on the characteristics resulting from the analysis of electromagnetic radiation from electronic components in standalone devices of cyber-physical systems is presented. A typical scheme for determining the state of a system is shown. Due to the features of equipment sustaining the infrastructure, assessment of an information security state is aimed at analyzing normal system operation rather than searching for signatures and characteristics of anomalies during various types of information attacks. An experiment that provides statistical information on the operation of remote devices of cyber-physical systems has been disclosed, whereby data for decision-making are accumulated by comparing statistical information. The experimental results on information influence on a typical system are presented. The proposed approach for analyzing statistical data of standalone devices based on a naive Bayesian classifier can be used to determine information security states. A special feature of the approach is the ability to rapid adaptation and application of various mathematical tools and machine learning methods to achieve a desired quality of probabilistic evaluation. Implementation of this type of monitoring does not require a development of complex system applications while allowing implementation of various architectures for system construction that are capable of processing on-board an autonomous object or of communicating data and calculating the state on external computer nodes of monitoring and control systems.
Statement of Research. A need to reduce the increasing number of system vulnerabilities caused by unauthorized software installed on computer aids necessitates development of an approach to automate the data-storage media audit. The article describes an approach to identification of informative assembly instructions. Also, the influence of a chosen feature that is used to create a unified program signature on identification result is shown. Methods. Shannon method allowing a determination of feature informativeness for a random number of object classes and not depending on the sample volume of observed features is used to calculate informativeness. Identification of elf-files was based on applying statistical chi-squared test of homogeneity. Main Findings. Quantitative characteristics of informativeness for 118 assembly instructions have been obtained. The analysis of experimental results for executable files identification with 10 different features used to create program signatures compared by means of the chi-squared test of homogeneity at significance levels p = 0.05 and p = 0.01 has been carried out. Practical Relevance. The importance of using a particular feature in program signature creation has been discovered, as well as the capability of considering several executable file signatures together to provide a summative assessment on their belonging to a certain program.
The paper studies the efficiency of nine state-of-the-art algorithms for scheduling of workflow applications in heterogeneous computing systems (HCS). The comparison of algorithms is performed on the base of discrete-event simulation for a wide range of workflow and system configurations. The developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results. The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms.
The research presented in this paper has been conducted in the framework of the large sociolinguistic project aimed at describing everyday spoken Russian and analyzing the special characteristics of its usage by different social groups of speakers. The research is based on the material of the ORD corpus containing long-term audio recordings of everyday communication. The aim of the given exploratory study is to reveal the linguistic parameters, in terms of which the difference in speech between different social groups is the most evident. An exploratory subcorpus, consisting of audio fragments of spoken communication of 12 respondents (6 men and 6 women, 4 representatives for each age group, and representatives of different professional and status groups) with the total duration of 106 min and of similar communication settings, was created and fully annotated. The quantitative description of a number of linguistic parameters on phonetic, lexical, morphological, and syntax levels in each social group was made. The biggest difference between social groups was observed in speech rate, phonetic reduction, lexical preferences, and syntactic irregularities. The study has shown that the differences between age groups are more significant than between gender groups, and the speech of young people differs most strongly from the others.
A digraph G = (V,E) with a distinguished set T ⊆ V of terminals is called inner Eulerian if for each v ∈ V − T the numbers of arcs entering and leaving v are equal. By a T-path we mean a simple directed path connecting distinct terminals with all intermediate nodes in V −T. This paper concerns the problem of finding a maximum T-path packing, i.e. a maximum collection of arc-disjoint T-paths. A min-max relation for this problem was established by Lomonosov. The capacitated version was studied by Ibaraki, Karzanov, and Nagamochi, who came up with a strongly-polynomial algorithm of complexity O(φ(V,E) ・ log T +V 2E) (hereinafter φ(n,m) denotes the complexity of a max-flow computation in a network with n nodes and m arcs). For unit capacities, the latter algorithm takes O(φ(V,E) ・ log T +V E) time, which is unsatisfactory since a max-flow can be found in o(V E) time. For this case, we present an improved method that runs in O(φ(V,E) ・ log T + E log V ) time. Thus plugging in the max-flow algorithm of Dinic, we reduce the overall complexity from O(V E) to O(min(V 2/3E,E3/2) ・ log T).
The insufficient performance of statistical recognition of composite objects (images, speech signals) is explored in case of medium-sized database (thousands of classes). In contrast to heuristic approximate nearest-neighbor methods we propose a statistically optimal greedy algorithm. The decision is made based on the Kullback-Leibler minimum information discrimination principle. The model object to be checked at the next step is selected from the class with the maximal likelihood (joint density) of distances to previously checked models. Experimental study results in face recognition task with FERET dataset are presented. It is shown that the proposed method is much more effective than the brute force and fast approximate nearest neighbor algorithms, such as randomized kd-tree, perm-sort, directed enumeration method.
Concept lattices built on noisy data tend to be large and hence hard to interpret. We introduce several measures that can be used in selecting relevant concepts and discuss how they can be combined together. We study their performance in a series of experiments.
A disjunctive model of box bicluster and tricluster analysis is considered. A least-squares locally-optimal one cluster method is proposed, oriented towards the analysis of binary data. The method involves a parameter, the scale shift, and is proven to lead to ”contrast” box biand tri-clusters. An experimental study of the method is reported.
We present a new concept of biclique as a tool for preimage attacks, which employs many powerful techniques from differential cryptanalysis of block ciphers and hash functions. The new tool has proved to be widely applicable by inspiring many authors to publish new results of the full versions of AES, KASUMI, IDEA, and Square. In this paper, we show how our concept leads to the first cryptanalysis of the round-reduced Skein hash function, and describe an attack on the SHA-2 hash function with more rounds than before.