### Book chapter

## Cross-Document Pattern Matching

We study a new variant of the string matching problem called {\em cross-document string matching}, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the {\em weighted level ancestor} problem.

### In book

This book constitutes the refereed proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching, CPM 2012, held in Helsinki, Finalnd, in July 2012. The 33 revised full papers presented together with 2 invited talks were carefully reviewed and selected from 60 submissions. The papers address issues of searching and matching strings and more complicated patterns such as trees, regular expressions, graphs, point sets, and arrays. The goal is to derive non-trivial combinatorial properties of such structures and to exploit these properties in order to either achieve superior performance for the corresponding computational problems or pinpoint conditions under which searches cannot be performed efficiently. The meeting also deals with problems in computational biology, data compression and data mining, coding, information retrieval, natural language processing, and pattern recognition.

We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern $P$ and a threshold $d$, we want to report (i) all longest extensions of $P$ which occur in at least $d$ documents, (ii) all shortest extensions of $P$ which occur in less than $d$ documents, and (iii) all shortest extensions of $P$ which occur only in $d$ selected documents. For these problems, we propose efficient algorithms based on suffix trees and using advanced data structure techniques. For problem (i), we propose an optimal solution with constant running time per output word.

This book constitutes the proceedings of the 21st International Symposium on String Processing and Information Retrieval, SPIRE 2014, held in Ouro Preto, Brazil, in October 2014. The 20 full and 6 short papers included in this volume were carefully reviewed and selected from 45 submissions. The papers focus not only on fundamental algorithms in string processing and information retrieval, but address also application areas such as computational biology, Web mining and recommender systems. They are organized in topical sections on compression, indexing, genome and related topics, sequences and strings, search, as well as on mining and recommending.

We consider a compact text index based on evenly spaced sparse suffix trees of a text \cite{KU-96}. Such a tree is defined by partitioning the text into blocks of equal size and constructing the suffix tree only for those suffixes that start at block boundaries. We propose a new pattern matching algorithm on this structure. The algorithm is based on a notion of suffix links different from that of~\cite{KU-96} and on the packing of several letters into one computer word.

Given a set of $N$ strings $A = \set{\alpha_1, \ldots, \alpha_N}$ of total length $n$ over alphabet~$\Sigma$ one may ask to find, for a fixed integer $K$, $2 \le K \le N$, the longest substring $\beta$ that appears in at least $K$ strings in $A$. It is known that this problem can be solved in $O(n)$ time with the help of suffix trees. However, the resulting algorithm is rather complicated. Also, its running time and memory consumption may depend on~$\abs{\Sigma}$. This paper presents an alternative, remarkably simple approach to the above problem, which relies on the notion of suffix arrays. Once the suffix array of some auxiliary $O(n)$-length string is computed, one needs a simple $O(n)$-time postprocessing to find the requested longest substring. Since a number of efficient and simple linear-time algorithms for constructing suffix arrays has been recently developed (with constant not depending on $|\Sigma|$), our approach seems to be quite practical.

This paper presents two new approaches to solving a classical NP-hard problem of maximum clique problem (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research, we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield neural network (HNN), we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the tabu-search heuristics improving performance of network dynamics of so-called tabu machine. Introduction of a special penalty function in tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.

We consider the problems of computing the maximal and the minimal non-empty suffixes of substrings of a longer text of length . n. For the minimal suffix problem we show that for every . τ, . 1≤τ≤logn, there exists a linear-space data structure with . O(τ) query time and . O(nlogn/τ) preprocessing time. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of the text in . O(kτ) time, where . k is the number of distinct factors in the decomposition. For the maximal suffix problem, we give a linear-space structure with . O(1) query time and . O(n) preprocessing time. In other words, we simultaneously achieve both the optimal query time and the optimal construction time. © 2015 Elsevier B.V.