Pair-wise sequence alignment is the basic method of comparative analysis of proteins and nucleic acids. Studying the results of alignment one has to consider two questions^ (1) did the program find all the interesting similarities (“sensitivity”) and (2) are all the found similarities interesting (“selectivity‘’). Definitely, one has to specify, what alignments are considered as the interesting ones. Analogous questions can be addressed to each of the obtained alignments: (3) which part of the aligned positions are aligned correctly (“confidence”) and (4) does alignment contain all pairs of the corresponding positions of compared sequences (“accuracy”). Naturally, the answer on the questions depends on the definition of correct alignment. The presentation addresses above two pairs of questions that are extremely important in interpreting of the results of sequence comparison.
The nearest neighbor search problem is well known since 60s. Many approaches have been proposed. One is to build a graph over the set of objects from a given database and use a greedy walk as a basis for a search algorithm. If the greedy walk has an ability to find the nearest neighbor in the graph starting from any vertex with a small number of steps, such a graph is called a navigable small world. In this paper we propose a new algorithm for building graphs with navigable small world properties. The main advantage of the proposed algorithm is that it is free from input parameters and has an ability to adapt on the fly to any changes in the distribution of data. The algorithm is based on the idea of removing local minimums by adding new edges. We realize this idea to improve search properties of the structure by using the set of queries in the execution stage. An empirical study of the proposed algorithm and comparison with previous works are reported in the paper.
The parallel computing algorithms are explored to improve the efficiency of image recognition with large database. The novel parallel version of the directed enumeration method (DEM) is proposed. The experimental study results in face recognition problem with FERET and Essex datasets are presented. We compare the performance of our parallel DEM with the original DEM and parallel implementations of the nearest neighbor rule and conventional Best Bin First (BBF) k-d tree. It is shown that the proposed method is characterized by increased computing efficiency (2-10 times in comparison with exhaustive search and the BBF) and lower error rate than the original DEM.
We consider the problem of sorting a sequence of n keys in a RAM-like environment where memory faults are possible. An algorithm is said to be δ-resilient if it can tolerate up to δ memory faults during its execution. A resilient sorting algorithm must produce a sequence where every pair of uncorrupted keys is ordered correctly. Finocchi, Grandoni, and Italiano devised a δ-resilient deterministic mergesort algorithm that runs in O(n log n + δ2) time. We present a δ-resilient randomized algorithm (based on quicksort) that runs in O(n log n + δ√n log n) expected time and its deterministic variation that runs in O(n log n +δ√n log n) worst-case time. This improves the previous known result for δ >√n log n. Our deterministric sorting relies on the notion of an approximate kth order statistic. For this auxiliary problem, we devise a deterministic algorithm that runs in O(n + δ√n) time and produces a key (either corrupted or not) whose order rank differs from k by at most O(δ).
In this paper we present a new approach to reduce the computational time spent on coloring in one of the recent branch-and-bound algorithms for the maximum clique problem. In this algorithm candidates to the maximum clique are colored in every search tree node. We suggest that the coloring computed in the parent node is reused for the child nodes when it does not lead to many new branches. So we reuse the same coloring only in the nodes for which the upper bound is greater than the current best solution only by a small value δ. The obtained increase in performance reaches 70 % on benchmark instances.
The article refers to the problem of ringing artifact suppression. The ringing effect is caused by high-frequency information corruption or loss, it appears as waves or oscillations near strong edges. We propose a novel method for ringing artifact suppression after Fourier cut-off filtering. It can be also used for image deringing in the case of image resampling and other applications where the frequency loss can be estimated. The method is based on the joint sparse coding approach. The proposed method preserves more small image details than the state-of-the-art algorithms based on total variation minimization, and outperforms them in terms of image quality metrics.
In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which distinguishes specific, background and topic terms, doesn’t need Dirichlet regularization and provides controllably sparse solution.
We propose a novel approach for solving the approximate nearest neighbor search problem in arbitrary metric spaces. The distinctive feature of our approach is that we can incrementally build a non-hierarchical distributed structure for given metric space data with a logarithmic complexity scaling on the size of the structure and adjustable accuracy probabilistic nearest neighbor queries. The structure is based on a small world graph with vertices corresponding to the stored elements, edges for links between them and the greedy algorithm as base algorithm for searching. Both search and addition algorithms require only local information from the structure. The performed simulation for data in the Euclidian space shows that the structure built using the proposed algorithm has navigable small world properties with logarithmic search complexity at fixed accuracy and has weak (power law) scalability with the dimensionality of the stored data.
Since the early 1990s, speaker adaptation have become one of the intensive areas in speech recognition. State-of-the-art batch-mode adaptation algorithms assume that speech of particular speaker contains enough information about the user's voice. In this article we propose to allow the user to manually verify if the adaptation is useful. Our procedure requires the speaker to pronounce syllables containing each vowel of particular language. The algorithm contains two steps looping through all syllables. At first, LPC analysis is performed for extracted vowel and the LPC coefficients are used to synthesize the new sound (with a fixed pitch period) and play it. If this synthesized sound is not perceived by the user as an original one then the syllable should be recorded again. At the second stage, speaker is asked to produce another syllable with the same vowel to automatically verify the stability of pronunciation. If two signals are closed (in terms of the Itakura-Saito divergence) then the sounds are marked as "good" for adaptation. Otherwise both steps are repeated. In the experiment we examine a problem of vowel recognition for Russian language in our voice control system which fuses two classifiers: the CMU Sphinx with speaker-independent acoustic model and Euclidean comparison of MFCC features of model vowel and input signal frames. Our results support the statement that the proposed approach provides better accuracy and reliability in comparison with traditional MAP/MLLR techniques implemented in the CMU Sphinx.
The paper presents experimental results on automatic word sense disambiguation (WSD). Contexts for polysemous and/or homonymic Russian nouns denoting physical objects serve as an empirical basis of the study. Sets of contexts were extracted from the Russian National Corpus (RNC). Machine learning software for WSD was developed within the framework of the project. WSD tool used in experiments is aimed at statistical processing and classification of noun contexts. WSD procedure was performed taking into account lexical markers of word meanings in contexts and semantic annotation of contexts. Sets of experi- ments allowed to define optimal conditions for WSD in Russian texts.
Stock selection by Sharp ratio is considered in the framework of multiple statistical hypotheses testing theory. The main attention is paid to comparison of Holm stepdown and Hochberg step up procedures for different loss functions. Comparison is made on the basis of condittional risk as a function of selection threshold. This approach allows to discover that properties of procedures depend not only on relationship between test statistics, but also depend on dispersion of Sharp ratios. Difference in error rate between two procedures is increasing when the concentration of Sharp ratios is increasing. When Sharp ratios do not have a concentration points there is no significant difference in quality of both procedures.
We study the computational complexity of finding a maximum independent set of vertices in a planar graph. In general, this problem is known to be NP-hard. However, under certain restrictions it becomes polynomial-time solvable. We identify a graph parameter to which the complexity of the problem is sensible and produce a number of both negative (intractable) and positive (solvable in polynomial time) results, generalizing several known facts.
The ORD corpus is a representative resource of everyday spoken Russian that contains about 1000 h of long-term audio recordings of daily communication made in real settings by research volunteers. ORD macro episodes are the large communication episodes united by setting/scene of communication, social roles of participants and their general activity. The paper describes annotation principles used for tagging of macro episodes, provides current statistics on communication situations presented in the corpus and reveals their most common types. Annotation of communication situations allows using these codes as filters for selection of audio data, therefore making it possible to study Russian everyday speech in different communication situations, to determine and describe various registers of spoken Russian. As an example, several high frequency word lists referring to different communication situations are compared. Annotation of macro episodes that is made for the ORD corpus is a prerequisite for its further pragmatic annotation.
This paper presents the parallel architecture of the conjugate gradient learning algorithm for the feedforward neural networks. The proposed solution is based on the high parallel structures to speed up learning performance. Detailed parallel neural network structures are explicitly shown.
The main syndrome of severe poisoning is coma. An option of coma outcome is a vegetative state. EEG reactivity due to intravenous benzodiazepines estimates the prognosis for such patients. However, a positive benzodiazepines test has the predictability of about 50-60 %. The aim of the work is to assess the role of interaction between gamma amino butyric acid (GABA) and cholinergic systems of the brain. The consequent injections of benzodiazepine and atropine lead to a 20 % increase in predictability. The results obtained confirm the following hypothesis. Abnormality of GABA-cholinergic interaction is one of the mechanisms of forming a stable pathological system resulting in the pathogenesis of the vegetative state.