Computing Discriminating and Generic Words
We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern $P$ and a threshold $d$, we want to report (i) all longest extensions of $P$ which occur in at least $d$ documents, (ii) all shortest extensions of $P$ which occur in less than $d$ documents, and (iii) all shortest extensions of $P$ which occur only in $d$ selected documents. For these problems, we propose efficient algorithms based on suffix trees and using advanced data structure techniques. For problem (i), we propose an optimal solution with constant running time per output word.
This volume contains the papers presented at the 6th International Conference on Similarity Search and Applications (SISAP 2013), held at A Coruna, Spain, during October 2–4, 2013. The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by many application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service. Traditionally, SISAP conferences have put emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search approach.
In this paper, we present a modification of dynamic programming algorithms (DPA), which we denote as graphical algorithms (GrA). For some single machine scheduling problems, it is shown that the time complexity of the GrA is less than the time complexity of the standard DPA. Moreover, the average running time of the GrA is often essentially smaller. A GrA can also solve large-scale instances and instances, where the parameters are not integer. For some problems, GrA has a polynomial time complexity in contrast to a pseudo-polynomial complexity of a DPA.
We study a new variant of the pattern matching problem called cross-document pattern matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted ancestor problem.
Information systems have been developed in parallel with computer science, although information systems have roots in different disciplines including mathematics, engineering, and cybernetics. Research in information systems is by nature very interdisciplinary. As it is evidenced by the chapters in this book, dynamics of information systems has several diverse applications. The book presents the state-of-the-art work on theory and practice relevant to the dynamics of information systems. First, the book covers algorithmic approaches to numerical computations with infinite and infinitesimal numbers. Also the book presents important problems arising in service-oriented systems, such as dynamic composition, analysis of modern service-oriented information systems, and estimation of customer service times on a rail network from GPS data. After that, the book addresses the complexity of the problems arising in stochastic and distributed systems. In addition, the book discusses modulating communication for improving multi-agent learning convergence. Network issues, in particular minimum risk maximum clique problems, vulnerability of sensor networks, influence diffusion, community detection, and link prediction in social network analysis, as well as a comparative analysis of algorithms for transmission network expansion planning are described in subsequent chapters. We thank all the authors and anonymous referees for their advice and expertise in providing valuable contributions, which improved the quality of this book. Furthermore, we want to thank Springer for helping us to produce this book.
In this paper, we consider algorithms involved in the computation of the Duquenne–Guigues basis of implications. The most widely used algorithm for constructing the basis is Ganter’s Next Closure, designed for generating closed sets of an arbitrary closure system. We show that, for the purpose of generating the basis, the algorithm can be optimized. We compare the performance of the original algorithm and its optimized version in a series of experiments using artificially generated and real-life datasets. An important computationally expensive subroutine of the algorithm generates the closure of an attribute set with respect to a set of implications. We compare the performance of three algorithms for this task on their own, as well as in conjunction with each of the two algorithms for generating the basis. We also discuss other approaches to constructing the Duquenne–Guigues basis.
We revisit the problems of computing the maximal and the minimal non-empty suffixes of a substring of a longer text of length n, introduced by Babenko, Kolesnichenko and Starikovskaya [CPM’13]. For the minimal suffix problem we show that for any 1 ≤ τ ≤ logn there exists a linear-space data structure with(τ)query time and(nlogn/τ)preprocessing time. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of the text in(kτ)time, where k is the number of distinct factors in the decomposition. For the maximal suffix problem we give a linear-space structure with(1)query time and(n)preprocessing time, i.e., we manage to achieve both the optimal query and the optimal construction time simultaneously.
One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.