### Book

## Proceedings of the 6th International Conference on Similarity Search and Applications (SISAP 2013), Lecture Notes in Computer Science

This volume contains the papers presented at the 6th International Conference on Similarity Search and Applications (SISAP 2013), held at A Coruna, Spain, during October 2–4, 2013. The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by many application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service. Traditionally, SISAP conferences have put emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search approach.

An important characteristic feature of recommender systems for web pages is the abundance of textual information in and about the items being recommended (web pages). To improve recommendations and enhance user experience, we propose to use automatic tag (keyword) extraction for web pages entering the recommender system. We present a novel tag extraction algorithm that employs semi-supervised classification based on a dataset consisting of pre-tagged documents and (for the most part) partially tagged documents whose tags are automatically mined from the content. We also compare several classification algorithms for tag extraction in this context.

One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.

A novel approach to solving the nearest neighbor search problem in metric space is considered. It is proposed as a data structure to use a graph with navigable small world properties and a gradient descent algorithm as a search algorithm. The problem of the existence of local minima is solved by a series of independent searches. Experimental data are presented to confirm logarithmic complexity of the search algorithm.

This book constitutes the refereed proceedings of the 25th Annual Symposium on Combinatorial Pattern Matching, CPM 2014, held in Moscow, Russia, in June 2014. The 28 revised full papers presented together with 5 invited talks were carefully reviewed and selected from 54 submissions. The papers address issues of searching and matching strings and more complicated patterns such as trees; regular expressions; graphs; point sets; and arrays. The goal is to derive combinatorial properties of such structures and to exploit these properties in order to achieve superior performance for the corresponding computational problems. The meeting also deals with problems in computational biology; data compression and data mining; coding; information retrieval; natural language processing; and pattern recognition.

In this paper, we present a modification of dynamic programming algorithms (DPA), which we denote as graphical algorithms (GrA). For some single machine scheduling problems, it is shown that the time complexity of the GrA is less than the time complexity of the standard DPA. Moreover, the average running time of the GrA is often essentially smaller. A GrA can also solve large-scale instances and instances, where the parameters are not integer. For some problems, GrA has a polynomial time complexity in contrast to a pseudo-polynomial complexity of a DPA.

We study a new variant of the pattern matching problem called *cross-document pattern matching*, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the *weighted ancestor* problem.

Information systems have been developed in parallel with computer science, although information systems have roots in different disciplines including mathematics, engineering, and cybernetics. Research in information systems is by nature very interdisciplinary. As it is evidenced by the chapters in this book, dynamics of information systems has several diverse applications. The book presents the state-of-the-art work on theory and practice relevant to the dynamics of information systems. First, the book covers algorithmic approaches to numerical computations with infinite and infinitesimal numbers. Also the book presents important problems arising in service-oriented systems, such as dynamic composition, analysis of modern service-oriented information systems, and estimation of customer service times on a rail network from GPS data. After that, the book addresses the complexity of the problems arising in stochastic and distributed systems. In addition, the book discusses modulating communication for improving multi-agent learning convergence. Network issues, in particular minimum risk maximum clique problems, vulnerability of sensor networks, influence diffusion, community detection, and link prediction in social network analysis, as well as a comparative analysis of algorithms for transmission network expansion planning are described in subsequent chapters. We thank all the authors and anonymous referees for their advice and expertise in providing valuable contributions, which improved the quality of this book. Furthermore, we want to thank Springer for helping us to produce this book.

We revisit the problems of computing the maximal and the minimal non-empty suffixes of a substring of a longer text of length *n*, introduced by Babenko, Kolesnichenko and Starikovskaya [CPM’13]. For the minimal suffix problem we show that for any 1 ≤ *τ* ≤ log*n* there exists a linear-space data structure with(τ)query time and(nlogn/τ)preprocessing time. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of the text in(kτ)time, where *k* is the number of distinct factors in the decomposition. For the maximal suffix problem we give a linear-space structure with(1)query time and(n)preprocessing time, i.e., we manage to achieve both the optimal query and the optimal construction time simultaneously.

We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.

We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.

We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.