An AST method for scoring string-to-text similiarity in semantic text analysis

B. Mirkin; E. Artemova

?

An AST method for scoring string-to-text similiarity in semantic text analysis

Mirkin B., Artemova E.

Abstract. A suffix-tree based method for measuring similarity of a key phrase to an unstructured text is proposed. The measure involves less computation and it does not depend on the length of the text or the key phrase. This applies to the following tasks in semantic text analysis:

Finding interrelations between key phrases over a set of texts;
Annotating a research article by topics from a taxonomy of the domain;
Clustering relevant topics and mapping clusters on a domain taxonomy.

Language: English

Full text

Keywords: suffix tree unstructured text analysis string-to-text similarity measures суффиксное дерево

Publication based on the results of:

Методы визуализации текстовой информации с помощью построения суффиксных деревьев, мультифасетных классификаций и иерархических онтологий: алгоритмическое и программное обеспечение (2013)

In book

Clusters, orders, trees: methods and applications. In Honor of Boris Mirkin's 70th Birthday

Vol. 92. , Berlin: Springer, 2014.

Система автоматической обработки русскоязычных текстов

Dubov M., Mirkin B., Шаль А. А., Открытые системы. СУБД 2014 № 10 С. 15–17

Currently, automating of text processing and analysis is a main tendency of IT applications. As of this moment, there is no unified approach to the analysis and visualization of big volumes of text data. Our system LM Monitor (Latent Meaning Monitor) generates so-called reference graphs which can be considered part of the popular technology of ...

Added: December 16, 2014

A Method for Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Artemova E., , in: Procedia Computer Science. 2nd International Conference on Information Technology and Quantitative Management, ITQM 2014. National Research University Higher School of Economics (HSE) in Moscow (Russia) on June 3-5, 2014Vol. 31. Amsterdam: Elsevier, 2014. Ch. 22 P. 193–200.

A two-step approach to taxonomy construction is presented. On the first step the frame of taxonomy is built manually according to some representative educational materials. On the second step, the frame is refined using the Wikipedia category tree and articles. Since the structure of Wikipedia is rather noisy, a procedure to clear the Wikipedia category ...

Added: October 14, 2014

Koznov D., Ledeneva E., Luciv D. et al., Programming and Computer Software 2024 Vol. 50 P. 85–89

Code comments are an essential part of software documentation. Many software projects suffer from the problem of low-quality comments that are often produced by copy-paste. In case of similar methods, classes, etc. copy-pasted comments with minor modifications are justified. However, in many cases this approach leads to degraded documentation quality and, subsequently, to problematic maintenance ...

Added: June 10, 2024

Sublinear Space Algorithms for the Longest Common Substring Problem

Starikovskaya T., Vildhoj H. W., Kociumaka T., , in: Algorithms - ESA 2014. 22th Annual European Symposium, Wrocław, Poland, September 8-10, 2014. ProceedingsVol. 8737. Berlin: Springer, 2014. P. 605–617.

Given $m$ documents of total length $n$, we consider the problem of finding a longest string common to at least $d \geq 2$ of the documents. This problem is known as the \emph{longest common substring (LCS) problem} and has a classic $\Oh(n)$ space and $\Oh(n)$ time solution (Weiner [FOCS'73], Hui [CPM'92]). However, the use of ...

Added: September 2, 2014

Abstracting concepts from text documents by using an ontology

Artemova E., Чугунова О. Н., Аскарова Ю. А. et al., , in: CDUD – 2010: International Workshop on Concept Discovery in Unstructured Data. M.: Higher School of Economics Publishing House, 2011. P. 20–31.

A method for computationally visualizing and interpreting a text or corpus of texts in a taxonomy of the field is described. The method involves such stages as matching taxonomy topics and text(s) by using annotated suffix trees (ASTs), combining multiple information such as text abstracts, key-words and taxonomy cross-references, building clusters of taxonomy topics and their profiles, and lifting ...

Added: December 27, 2012