Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

M. Dubov

doi:10.1007/978-3-319-26123-2

Publications

?

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

P. 296–307.

Dubov M.

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called ''EAST'', which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

Language: English

Full text

DOI

Keywords: text analysis algorithms on strings annotated suffix trees suffix arrays synonym extraction

In book

Analysis of Images, Social Networks and Texts. 4th International Conference, AIST 2015, Yekaterinburg, Russia, April 9–11, 2015, Revised Selected Papers

Vol. 542: Series: Communications in Computer and Information Science. , Switzerland: Springer, 2015.

Перспективы медиа-мониторинга в исследованиях общественного мнения (на примере доверия президенту)

Ankudinov I., Социология: методология, методы, математическое моделирование 2025 № 61 С. 165–203

The changing political mood of Russians is a constant subject of interest for sociological agencies. With the development of the Internet, conventional questionnaire research began to be supplemented by online surveys and, despite some skepticism, by social media mining. This article attempts to adjust an accidental web-sample so as to bring its estimates closer to ...

Added: April 22, 2026

Алгоритм анализа новостной информации для принятия экономических решений

Ramenskaya A., Чудинова О. С., Первицкая Л. А., Индустриальная экономика 2026 № 1 С. 65–78

This article is devoted to the development of an algorithm for analyzing news information using machine learning methods implemented in Python libraries. The choice of tools used at each stage of the algorithm is justified by calculating metrics for the quality of the solution to the corresponding machine learning problems. The algorithm’s results are presented ...

Added: April 20, 2026

Юсуф-Ходжа и его братья: О родстве Афанасия Никитина

Lifshits A., Slovĕne 2025 Т. 14 № 1 С. 300–312

The article considers those episodes from the notes of Afanasy Nikitin that allow us to doubt his merchant status. Based on the analysis of grammar, vocabulary and pragmatics of Afanasy’s messages, it is concluded that he traveled along the Volga and further as the head of a small community of people and that he differed ...

Added: September 3, 2025

Semantic Text Analysis Using Artificial Neural Networks Based on Neural-Like Elements with Temporal Signal Summation

Kharlamov Alexander, Eugeny S., Kuznetsov D. et al., Problems of Artificial Intelligence 2023 No. 3(30) P. 4–27

Text as an image is analyzed in the human visual analyzer. In this case, the image is scanned along the points of the greatest informativity, which are the inflections of the contours of the equitextural areas, into which the image is roughly divided. In the case of text analysis, individual characters of the alphabet are ...

Added: October 20, 2024

Use of Text Skeleton Structures for the Development of Semantic Search Methods

A. V. Mylnikova, V. A. Trusov, L. A. Mylnikov, Automatic Documentation and Mathematical Linguistics 2023 Vol. 57 No. 5 P. 301–307

This paper considers the problem of the generation of descriptors to reduce data volumes, text data resources, and search times through the use of the new factors of authorship, region, emotive meaning, and popularity, as well as a text category without special marks that can be used to generate descriptors. This approach allows the use ...

Added: February 29, 2024

Investor sentiment and the NFT hype index: to buy or not to buy?

Baklanova V., Kurkin A., Teplova T., China Finance Review International 2024 Vol. 14 No. 3 P. 522–548

Purpose – The primary objective of this research is to provide a precise interpretation of the constructed machine learning model and produce definitive summaries that can evaluate the influence of investor sentiment on the overall sales of non-fungible token (NFT) assets. To achieve this objective, the NFT hype index was constructed as well as several approaches of ...

Added: December 10, 2023

SmartTips: Online Products Recommendations System Based on Analyzing Customers Reviews

Ali N., Alshahrani A., Alghamdi A. et al., Applied Sciences (Switzerland) 2022 Vol. 12 No. 17 Article 8823

Online customers’ opinions represent a significant resource for both customers and enterprises to extract much information that helps them make the right decision. Finding relevant data while searching the internet is a big challenge for web users, known as the “Problem of Information Overload”. Recommender systems have been recognized as a promising way of solving ...

Added: October 4, 2022

A Semi-automated Pipeline for Mapping the Shifts and Continuities in Media Discourse

Shirokanova A., Silyutina O., , in: Digital Transformation and Global Society. 6th International Conference, DTGS 2021, St. Petersburg, Russia, June 23–25, 2021, Revised Selected Papers.: Springer, 2022. P. 19–35.

Added: January 27, 2022

ОЦЕНКА КАЧЕСТВА РАСКРЫТИЯ НЕФИНАНСОВОЙ ИНФОРМАЦИИ ПО СТАНДАРТАМ GRI РОССИЙСКИМИ КОМПАНИЯМИ

Fedorova E., Khrustova L., Демин И. С., AlterEconomics (ранее - Журнал экономической теории) 2020 Т. 17 № 2 С. 412–423

The non-financial information is defined as a significant determinant of the company’s activity in terms of many modern theories. The evolution of the company’s investment attractiveness evaluating theory has led to the conclusion that the determining factors include other non-financial characteristics of the company, such as management structure, degree of social and environmental responsibility and ...

Added: October 23, 2021

Методы классификации текстовых данных: можно ли потенциал количественного анализа использовать в качественном исследовании?

Aleksandrova M., ИНТЕРакция. ИНТЕРвью. ИНТЕРпретация 2021 Т. 13 № 2 С. 81–96

Text mining has developed rapidly in recent years. In this article, we compare classification methods that are suitable for solving problems of predicting item nonresponse. The author builds reasoning about how the analysis of textual data can be implemented in a wider research field based on this material. The author considers a number of metrics ...

Added: August 20, 2021

News headline as a form of news text compression

Kochetkova N. A., Pronoza E., Yagunova E., , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10th International Conference on Social Informatics, SocInfo 2018; St.Petersburg.: Springer, 2018. P. 139–147.

In this paper we analyze news text collections (clusters) via extracting their paraphrase headlines into a paraphrase graph and working with this graph. Our aim is to test whether news headline is an appropriate form of news text compression. Different types of news collections: dynamic, static and combined (both dynamic and static) clusters are analyzed ...

Added: October 30, 2020

ТОНАЛЬНОСТЬ ОСВЕЩЕНИЯ ПОЗИЦИИ РОССИИ В АНГЛОЯЗЫЧНЫХ СМИ В ПЕРИОД САНКЦИЙ

Khrustova L., Федоров Ф. Ю., Fedorova E., Контуры глобальных трансформаций: политика, экономика, право 2020 Т. 13 № 4 С. 292–310

Обострение политической обстановки, которая свойственна текущей стадии развития международных отношений, сопровождается масштабной информационной войной. Проблема освещения положения России в международной прессе с негативной точки зрения обсуждается с начала 2000-х годов. Российско-украинский конфликт, который начался в конце 2013 - начале 2014 годов, заставил иностранные средства массовой информации вновь обратить внимание на Россию и спровоцировал увеличение количества ...

Added: October 29, 2020

Полнота раскрытия нефинансовой информации российскими компаниями: влияние на инвестиционную привлекательность

Khrustova L., Fedorova E., Демин И. С., Российский журнал менеджмента 2020 Т. 18 № 1 С. 51–72

In the context of the development of the digital economy, the role of a company’s information transparency has become increasingly important. Alongside purely financial information, investors are more likely to also take into account the disclosure of non-financial information in the annual accounts. The purpose of this study is to empirically examine the relationship between ...

Added: August 20, 2020

DISTRIBUTIONAL AND NETWORK SEMANTICS. TEXT ANALYSIS APPROACHES

Kharlamov A. A., Pantiukhin D., Gordeev D., , in: Neuroinformatics and Semantic Representations: Theory and Applications.: Cambridge Scholars Publishing, 2020. Ch. 4 P. 55–113.

Abstract. Over the past decade, a new wave of interest in dialogue agents has been observed. This is largely due to the introduction of machine learning in the tasks of automatic natural language processing. Using the tools of distributional and network semantics makes it possible to summarize data from huge corpora of texts. New language ...

Added: June 22, 2020

Application of NLP Algorithms: Automatic Text Classifier Tool

Romanov A., Ekaterina Kozlova, Lomotin Konstantin, , in: Digital Transformation and Global Society. Third International Conference, DTGS 2018, St. Petersburg, Russia, 2018, Revised Selected Papers. Part II. Communications in Computer and Information Science 859Issue 859.: Springer, 2018. P. 310–323.

This research is dedicated to the design of a decision support system for categorization of scientific literature. The purpose of this work is to research possible ways to apply the machine learning algorithms to the automation of manual text categorization. The following stages are considered: preprocessing of raw data, word embedding, model selection, classification model, ...

Added: August 26, 2019

Using Domain Taxonomy to Model Generalization of Thematic Fuzzy Clusters

Frolov D., Mirkin B., Nascimento S. et al., , in: CONTENT 2019, The Eleventh International Conference on Creative Content Technologies.: International Academy, Research, and Industry Association (IARIA), 2019. P. 20–25.

We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its 'head subject' in the higher ranks of the taxonomy tree. The head subject is supposed to 'tightly' cover the query set, possibly bringing in some ...

Added: June 4, 2019

Success Factors of Electronic Petitions at Russian Public Initiative Project: The Role of Informativeness, Topic and Lexical Information

Porshnev A., , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10th International Conference on Social Informatics, SocInfo 2018; St.Petersburg.: Springer, 2018. P. 243–250.

Online petitions are usually regarded as one of the most popular channels to involve citizens in the political process. In our paper we have analyzed texts and voting data (pro and against) from 9705 e-petitions submitted from 2013 until 2017 at Russian Public Initiative project. Analysis of dynamics showed stabilization of interest to this resource (emergence of a new ...

Added: February 12, 2019

Исследовательский проект как инструмент обучения методам анализа текста: предсказание класса поста в социальной сети

Suvorova A., Смирнова К. Р., Будин Е. А. et al., Компьютерные инструменты в образовании 2018 № 3 С. 49–64

The article describes a student research project on predicting the class of a post on a social network based on its textual content. The features of the project are discussed as an integral part of the trajectory of teaching data analysis methods, including text analysis methods and tools that are often not included in machine ...

Added: January 28, 2019

Влияние тональности писем CEO на финансовые показатели компании

Fedorova E., Осетров Р. А., Демин И. С. et al., Российский журнал менеджмента 2017 Т. 15 № 4 С. 441–462

The paper is devoted to the analysis of CEO letters as an instrument for influencing the expectations of shareholders and potential investors. The aim of the research is to analyze empirically the influence of semantic characteristics of CEO letters on financial indicators of the company. The authors suggested that CEO letter’s tonality, its length and ...

Added: October 23, 2018

Digital Humanities в истории психологии (на примере фамилии В.М. Бехтерева)

Костригин А. А., Khusyainov T., Цифровой ученый: лаборатория философа 2018 Т. 1 № 1 С. 160–179

The article discusses the problems and prospects for using the methodology of Digital Humanities in the historical psychological studies. The authors present the results of the search and analysis of the mentions found for the name of the outstanding psychophysiologist, psychoneurolo-gist, and psychologist Vladimir M. Bekhterev (1857-1927) in the body of texts in the Google ...

Added: April 3, 2018

Texterra: A framework for text analysis.

S.D. Kuznetsov, D.Yu. Turdakov, Астраханцев Н. А. et al., Programming and Computer Software 2014 Vol. 40 No. 5 P. 288–295

A framework for fast text analysis, which is developed as a part of the Texterra project, is described. Texterra provides a scalable solution for the fast text processing on the basis of novel methods that exploit knowledge extracted from the Web and text documents. For the developed tools, details of the project, use cases, and ...

Added: November 26, 2017