In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which distinguishes specific, background and topic terms, doesn’t need Dirichlet regularization and provides controllably sparse solution.
We propose a novel approach for solving the approximate nearest neighbor search problem in arbitrary metric spaces. The distinctive feature of our approach is that we can incrementally build a non-hierarchical distributed structure for given metric space data with a logarithmic complexity scaling on the size of the structure and adjustable accuracy probabilistic nearest neighbor queries. The structure is based on a small world graph with vertices corresponding to the stored elements, edges for links between them and the greedy algorithm as base algorithm for searching. Both search and addition algorithms require only local information from the structure. The performed simulation for data in the Euclidian space shows that the structure built using the proposed algorithm has navigable small world properties with logarithmic search complexity at fixed accuracy and has weak (power law) scalability with the dimensionality of the stored data.
Since the early 1990s, speaker adaptation have become one of the intensive areas in speech recognition. State-of-the-art batch-mode adaptation algorithms assume that speech of particular speaker contains enough information about the user's voice. In this article we propose to allow the user to manually verify if the adaptation is useful. Our procedure requires the speaker to pronounce syllables containing each vowel of particular language. The algorithm contains two steps looping through all syllables. At first, LPC analysis is performed for extracted vowel and the LPC coefficients are used to synthesize the new sound (with a fixed pitch period) and play it. If this synthesized sound is not perceived by the user as an original one then the syllable should be recorded again. At the second stage, speaker is asked to produce another syllable with the same vowel to automatically verify the stability of pronunciation. If two signals are closed (in terms of the Itakura-Saito divergence) then the sounds are marked as "good" for adaptation. Otherwise both steps are repeated. In the experiment we examine a problem of vowel recognition for Russian language in our voice control system which fuses two classifiers: the CMU Sphinx with speaker-independent acoustic model and Euclidean comparison of MFCC features of model vowel and input signal frames. Our results support the statement that the proposed approach provides better accuracy and reliability in comparison with traditional MAP/MLLR techniques implemented in the CMU Sphinx.
The paper presents experimental results on automatic word sense disambiguation (WSD). Contexts for polysemous and/or homonymic Russian nouns denoting physical objects serve as an empirical basis of the study. Sets of contexts were extracted from the Russian National Corpus (RNC). Machine learning software for WSD was developed within the framework of the project. WSD tool used in experiments is aimed at statistical processing and classification of noun contexts. WSD procedure was performed taking into account lexical markers of word meanings in contexts and semantic annotation of contexts. Sets of experi- ments allowed to define optimal conditions for WSD in Russian texts.
Stock selection by Sharp ratio is considered in the framework of multiple statistical hypotheses testing theory. The main attention is paid to comparison of Holm stepdown and Hochberg step up procedures for different loss functions. Comparison is made on the basis of condittional risk as a function of selection threshold. This approach allows to discover that properties of procedures depend not only on relationship between test statistics, but also depend on dispersion of Sharp ratios. Difference in error rate between two procedures is increasing when the concentration of Sharp ratios is increasing. When Sharp ratios do not have a concentration points there is no significant difference in quality of both procedures.
We study the computational complexity of finding a maximum independent set of vertices in a planar graph. In general, this problem is known to be NP-hard. However, under certain restrictions it becomes polynomial-time solvable. We identify a graph parameter to which the complexity of the problem is sensible and produce a number of both negative (intractable) and positive (solvable in polynomial time) results, generalizing several known facts.
The ORD corpus is a representative resource of everyday spoken Russian that contains about 1000 h of long-term audio recordings of daily communication made in real settings by research volunteers. ORD macro episodes are the large communication episodes united by setting/scene of communication, social roles of participants and their general activity. The paper describes annotation principles used for tagging of macro episodes, provides current statistics on communication situations presented in the corpus and reveals their most common types. Annotation of communication situations allows using these codes as filters for selection of audio data, therefore making it possible to study Russian everyday speech in different communication situations, to determine and describe various registers of spoken Russian. As an example, several high frequency word lists referring to different communication situations are compared. Annotation of macro episodes that is made for the ORD corpus is a prerequisite for its further pragmatic annotation.
This paper presents the parallel architecture of the conjugate gradient learning algorithm for the feedforward neural networks. The proposed solution is based on the high parallel structures to speed up learning performance. Detailed parallel neural network structures are explicitly shown.
The main syndrome of severe poisoning is coma. An option of coma outcome is a vegetative state. EEG reactivity due to intravenous benzodiazepines estimates the prognosis for such patients. However, a positive benzodiazepines test has the predictability of about 50-60 %. The aim of the work is to assess the role of interaction between gamma amino butyric acid (GABA) and cholinergic systems of the brain. The consequent injections of benzodiazepine and atropine lead to a 20 % increase in predictability. The results obtained confirm the following hypothesis. Abnormality of GABA-cholinergic interaction is one of the mechanisms of forming a stable pathological system resulting in the pathogenesis of the vegetative state.
Tropical algebra emerges in many fields of mathematics such as algebraic geometry, mathematical physics and combinatorial optimization. In part, its importance is related to the fact that it makes various parameters of mathematical objects computationally accessible. Tropical polynomials play an important role in this, especially for the case of algebraic geometry. On the other hand, many algebraic questions behind tropical polynomials remain open. In this paper we address three basic questions on tropical polynomials closely related to their computational properties:1.
Given a polynomial with a certain support (set of monomials) and a (finite) set of inputs, when is it possible for the polynomial to vanish on all these inputs?
A more precise question, given a polynomial with a certain support and a (finite) set of inputs, how many roots can polynomial have on this set of inputs?
Given an integer k, for which s there is a set of s inputs such that any non-zero polynomial with at most k monomials has a non-root among these inputs?
In the classical algebra well-known results in the direction of these questions are Combinatorial Nullstellensatz, Schwartz-Zippel Lemma and Universal Testing Set for sparse polynomials respectively. In this paper we extensively study these three questions for tropical polynomials and provide results analogous to the classical results mentioned above.
Mining ternary relations or triadic Boolean tensors is one of the recent trends in knowledge discovery that allows one to take into account various modalities of input object-attribute data. For example, in movie databases like IMBD, an analyst may find not only movies grouped by specific genres but see their common keywords. In the so called folksonomies, users can be grouped according to their shared resources and used tags. In gene expression analysis, genes can be grouped along with samples of tissues and time intervals providing comprehensible patterns. However, pattern explosion effects even with one more dimension are seriously aggravated. In this paper, we continue our previous study on searching for a smaller collection of “optimal” patterns in triadic data with respect to a set of quality criteria such as patterns’ cardinality, density, diversity, coverage, etc. We show how a simple data preprocessing has enabled us to use the frequent itemset mining algorithm Krimp based on MDL-principle for triclustering purposes.
A regular realizability (RR) problem is to test nonemptiness of the intersection of some fixed language (filter) with a given regular language. We show that RR problems are universal in the following sense. For any language L there exists an RR problem equivalent to L under disjunctive reductions on nondeterministic log space.
We deduce from this result the existence of RR problems complete under polynomial reductions for many complexity classes including all classes of the polynomial hierarchy.
Following the discussion on the role of Internet in the formation of ties across space, this paper seeks to supplement recent findings on prevalence of location-dependent preferential attachment online. We look at networks of online communities specifically aimed at development of location-independent ties. The paper focuses on the 25 largest communities of software developers in the leading Russian social networking site VKontakte, one of the communities being studied in depth. Evidence suggests that membership and friendship ties are overwhelmingly cross-city and even cross-country, while an in-depth analy-sis gives ground to assume that, commenting and liking in such communities might also be location-independent. This group case study provides some in-sights into a nature of professional networking and shows independence of the three networks: the friendship network as a means of group identification, the commenting network as an advice-giving tool, and the liking network as a result of approval by occasional visitors.