Cross-Document Pattern Matching

Kucherov G.; Nekrich Y.; T. Starikovskaya

?

Cross-Document Pattern Matching

P. 196-207.

Kucherov G., Nekrich Y., Starikovskaya T.

We study a new variant of the string matching problem called {\em cross-document string matching}, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the {\em weighted level ancestor} problem.

Language: English

Keywords: алгоритмы обработки слов data structures структуры данных string algorithms

In book

Lecture Notes in Computer Science

Vol. 7354: Proceedings of the 23rd Symposium on Combinatorial Pattern Matching. , Berlin : Springer, 2012

On Minimal and Maximal Suffixes of a Substring

Babenko M., Kolesnichenko I., Starikovskaya T., , in : Lecture Notes in Computer Science. Vol. 7922: Proceedings of the 24th Symposium on Combinatorial Pattern Matching.: Berlin : Springer, 2013. P. 28-37.

Lexicographically minimal and lexicographically maximal suffixes of a string are fundamental notions of stringology. It is well known that the lexicographically minimal and maximal suffixes of a given string S can be computed in linear time and space by constructing a suffix tree or a suffix array of S. Here we consider the case when ...

Added: October 30, 2013

Computing Discriminating and Generic Words

Kucherov G., Nekrich Y., Starikovskaya T., , in : Lecture Notes in Computer Science. Vol. 7608: Proceedings of the 19th International Symposium on String Processing and Information Retrieval.: Berlin : Springer, 2012. P. 307-317.

We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern $P$ and a threshold $d$, we want to report (i) all longest extensions of $P$ which occur in at least $d$ documents, (ii) all shortest extensions of $P$ which occur in less than $d$ ...

Added: October 30, 2013

Time-Space Trade-Offs for the Longest Common Substring Problem

Vildhoj H. W., Starikovskaya T., , in : Lecture Notes in Computer Science. Vol. 7922: Proceedings of the 24th Symposium on Combinatorial Pattern Matching.: Berlin : Springer, 2013. P. 223-234.

Lexicographically minimal and lexicographically maximal suffixes of a string are fundamental notions of stringology. It is well known that the lexicographically minimal and maximal suffixes of a given string $S$ can be computed in linear time and space by constructing a suffix tree or a suffix array of $S$. Here we consider the case when ...

Added: October 30, 2013

Computing Longest Common Substrings Via Suffix Arrays

Babenko M. A., Starikovskaya T., , in : Lecture Notes in Computer Science. Vol. 5010: Proceedings of the Third International Computer Science Symposium in Russia.: Berlin : Springer, 2008. P. 64-75.

Given a set of $N$ strings $A = \set{\alpha_1, \ldots, \alpha_N}$ of total length $n$ over alphabet~$\Sigma$ one may ask to find, for a fixed integer $K$, $2 \le K \le N$, the longest substring $\beta$ that appears in at least $K$ strings in $A$. It is known that this problem can be solved in ...

Added: October 30, 2013

The inverted multi-index

Babenko A., Lempitsky V., , in : Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012). : Providence : IEEE, 2012. P. 3069-3076.

A new data structure for efficient similarity search in very large dataseis of high-dimensional vectors is introduced. This structure called the inverted multi-index generalizes the inverted index idea by replacing the standard quantization within inverted indices with product quantization. For very similar retrieval complexity and preprocessing time, inverted multi-indices achieve a much denser subdivision of ...

Added: October 1, 2014

Технологии разработки объектно-ориентированных программ на языке С++. Часть 1. Основы структурного программирования на алгоритмическом языке С++

Полякова О. А., Пермь : Издательство Пермского национального исследовательского политехнического университета, 2019

The article deals with the application of the basic principles of structured programming in complex programs systems in the high-level language C ++, which are demonstrated on meaningful examples. ...

Added: August 31, 2020

Вычисление длиннейшей общей подстроки с одной ошибкой

Babenko M. A., Starikovskaya T., Проблемы передачи информации 2011 Т. 47 № 1 С. 28-33

Описан алгоритм, решающий задачу нахождения приближенной максимальной общей подстроки двух строк $\alpha_1$ и $\alpha_2$ за время $O(\abs{\alpha_1} \abs{\alpha_2})$ с использованием $O(\abs{\alpha_1})$ дополнительной памяти. При обращении к строке $\alpha_2$ алгоритм читает ее только \emph{слева направо, начиная с первого символа}. Используется RAM-модель вычислений. ...

Added: October 30, 2013

Lecture Notes in Computer Science

Berlin : Springer, 2012

This book constitutes the refereed proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching, CPM 2012, held in Helsinki, Finalnd, in July 2012. The 33 revised full papers presented together with 2 invited talks were carefully reviewed and selected from 60 submissions. The papers address issues of searching and matching strings and more complicated patterns ...

Added: October 30, 2013

АНАЛИЗ ПРОИЗВОДИТЕЛЬНОСТИ СТРАТЕГИЙ СИНХРОНИЗАЦИИ ПОТОКОВ В СТРУКТУРАХ ДАННЫХ, ОСНОВАННЫХ НА FLAT-COMBINING

Галимуллин М. Ф., Kalishenko E., Рапоткин Н. А., Известия Санкт-Петербургского государственного электротехнического университета ЛЭТИ 2016 № 7 С. 13-23

Deals with the development of threads synchronizing strategies based on the creation of concurrent «flat-combining» data structures as well as research of their performance. The paper considers «flat-combining» approach and its implementation in the library libcds, the development of thread synchronization strategy and its possible implementations. The efficiency of synchronization strategies usage is researched on ...

Added: November 1, 2018

Hybrid neural network and bi-criteria tabu-machine: comparison of new approaches to maximum clique problem

Babkina T. S., Demidovskij A., Babkin E., International Journal of Big Data Intelligence 2018 Vol. 5 No. 3 P. 143-155

This paper presents two new approaches to solving a classical NP-hard problem of maximum clique problem (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research, we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach ...

Added: October 3, 2018

Pattern Matching on Sparse Suffix Trees

Kolpakov R. M., Kucherov G., Starikovskaya T., , in : Proceedings of the First International Conference on Data Compression, Communications and Processing. : NY : IEEE Computer Society, 2013. P. 92-97.

We consider a compact text index based on evenly spaced sparse suffix trees of a text \cite{KU-96}. Such a tree is defined by partitioning the text into blocks of equal size and constructing the suffix tree only for those suffixes that start at block boundaries. We propose a new pattern matching algorithm on this structure. ...

Added: October 30, 2013

Технологии разработки информационных систем: сборник статей международной научно-практической конференции

Таганрог : Издательство ЮФУ, 2015

Сборник составлен по материалам VI Международной научно-практической конференции "Технологии разработки информационных систем", состоявшейся 6-12 сентабря 2015 г. в г. Геленджик. Ответственность за аутентичность и точность цитат, имен, названий и иных сведений несут авторы публикуемых материалов. Материалы публикуются в авторской редакции. Мероприятие проведено при финансовой поддержке Российского фонда фундаментальных исследований (грант № 15-07-20559-г). ...

Added: September 13, 2015

Wavelet Trees Meet Suffix Trees

Babenko M. A., Gawrychowski P., Kociumaka T. et al., , in : Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. : San Diego : SIAM, 2015. P. 572-591.

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size ω ≤ n, our method builds the wavelet tree in O(n log ω √log n) time, improving upon the state-of-the-art algorithm ...

Added: October 4, 2014

Fundamentals of Computation Theory, 22nd International Symposium, FCT 2019, Copenhagen, Denmark, August 12-14, 2019, Proceedings

Springer, 2019

Added: August 4, 2019

Cascade Heap: Towards Time-Optimal Extractions

Babenko M. A., Kolesnichenko I., Smirnov I., Theory of Computing Systems 2019 Vol. 63 No. 4 P. 637-646

Heaps are well-studied fundamental data structures, having myriads of applications, both theoretical and practical. We consider the problem of designing a heap with an “optimal” extract-min operation. Assuming an arbitrary linear ordering of keys, a heap with n elements typically takes O(log n) time to extract the minimum. Extracting all elements faster is impossible as ...

Added: December 6, 2019

Automata Equipped with Auxiliary Data Structures and Regular Realizability Problems

Rubtsov A. A., Vyalyi M., , in : Descriptional Complexity of Formal Systems: 23rd IFIP WG 1.02 International Conference, DCFS 2021, Virtual Event, September 5, 2021, Proceedings. : Springer, 2021. P. 150-162.

Added: February 2, 2022

Подходы к организации поискового дерева решений в методе ветвей и границ для асимметричной задачи коммивояжера

Fomichev M., Ulyanov M., Информационные технологии 2018 Т. 24 № 11 С. 698-704

Повышение временной эффективности программных реализаций метода ветвей и границ для асимметричной задачи коммивояжера может быть достигнуто как за счет выбора наиболее приемлемой структуры данных, обеспечивающей эффективные по времени операции с листьями поискового дерева решений, так и за счет использования дополнительной памяти для хранения усеченных матриц в листьях поискового дерева решений. Дополнительно могут быть предложены и ...

Added: January 26, 2020

Lecture Notes in Computer Science

Berlin, Heidelberg : Springer, 2017

The 12th issue of LNCS Transactions on Petri Nets and Other Models of Concurrency (ToPNoC) contains revised and extended versions of a selection of the best papers from the workshops held at the 37th International Conference on Application and Theory of Petri Nets and Concurrency (Petri Nets 2016, Toruń, Poland, 19–24 June 2016), and the ...

Added: September 27, 2017

Minimal Discriminating Words Problem Revisited

Kucherov G., Nekrich Y., Gawrychowski P. et al., , in : Lecture Notes in Computer Science. Vol. 8214: Proceedings of the 20th Symposium on String Processing and Information Retrieval.: Berlin : Springer, 2013. P. 129-140.

We revisit two variants of the problem of computing minimal discriminating words studied in [5]. Given a pattern P and a threshold d, we want to report (i) all shortest extensions of P which occur in less than d documents, and (ii) all shortest extensions of P which occur only in d selected documents. For ...

Added: October 30, 2013

Computing minimal and maximal suffixes of a substring

Maxim Babenko, Gawrychowski P., Kociumaka T. et al., Theoretical Computer Science 2016 Vol. 638 P. 112-121

We consider the problems of computing the maximal and the minimal non-empty suffixes of substrings of a longer text of length . n. For the minimal suffix problem we show that for every . τ, . 1≤τ≤logn, there exists a linear-space data structure with . O(τ) query time and . O(nlogn/τ) preprocessing time. As a ...

Added: October 8, 2015

Организация быстрого поиска без индекса

Ponomarenko A., В кн. : Труды 38-й конференции "Информационные технологии и системы - 2014". : Н. Новгород : ИППИ РАН, 2014. С. 194-200.

Классическим подходом к организации информации для последующего быстрого поиска является построение индекса. Однако этот подход имеет несколько недостатков. Индекс необходимо перестраивать и поддерживать в актуальном виде, что затруднительно в случае разрозненной информации, такой как текстовая информация в WEB. Эти недостатки являются следствием того, что индекс является реорганизованной копией индексируемой информации. В данной работе предлагается способ ...

Added: September 10, 2014

Cross-document Pattern Matching

Kopelowitz T., Kucherov G., Nekrich Y. et al., Journal of Discrete Algorithms 2013

We study a new variant of the pattern matching problem called cross-document pattern matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear ...

Added: October 30, 2013

Algorithms and Data Structures. WADS 2019. Lecture Notes in Computer Science

Springer, 2019

16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5–7, 2019, Proceedings ...

Added: October 26, 2021

Resource characteristics of ways to organize a decision tree in the branch-andboundmethod for the traveling salesmen problem

Ulyanov M.V., Fomichev M.I., Business Informatics 2015 No. 4 (34) P. 38-46

The resource efficiency of different implementations of the branch-and-bound method for the classical traveling salesman problem depends, inter alia, on ways to organize a search decision tree generated by this method. The classic «time-memory» dilemma is realized herein either by an option of storing reduced matrices at the points of the decision tree, which leads ...

Added: November 5, 2016