• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
May 20, 2026
HSE University Opens First Representative Office of Satellite Laboratory in Brazil
HSE University-St Petersburg opened a representative office of the Satellite Laboratory on Social Entrepreneurship at the University of Campinas in Brazil. The platform is going to unite research and educational projects in the spheres of sustainable development, communications and social innovations.
May 18, 2026
The 'Second Shift' Is Not Why Women Avoid News
Women are more likely than men to avoid political and economic news, but the reasons for this behaviour are linked less to structural inequality or family-related stress than to personal attitudes and the emotional perception of news content. This conclusion was reached by HSE researchers after analysing data from a large-scale survey of more than 10,000 residents across 61 regions of Russia. The study findings have been published in Woman in Russian Society.
May 15, 2026
Preserving Rationality in a Period of Turbulence
The HSE International Laboratory for Logic, Linguistics and Formal Philosophy studies logic and rationality in a transformed world characterised by a diversity of logical systems and rational agents. The laboratory supports and develops academic ties with Russian and international partners. The HSE News Service spoke with the head of the laboratory, Prof. Elena Dragalina-Chernaya, about its work.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

P. 296–307.
Dubov M.

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called ''EAST'', which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

Language: English
Full text
DOI
Keywords: text analysisalgorithms on stringsannotated suffix treessuffix arrayssynonym extraction

In book

Analysis of Images, Social Networks and Texts. 4th International Conference, AIST 2015, Yekaterinburg, Russia, April 9–11, 2015, Revised Selected Papers
Analysis of Images, Social Networks and Texts. 4th International Conference, AIST 2015, Yekaterinburg, Russia, April 9–11, 2015, Revised Selected Papers
Vol. 542: Series: Communications in Computer and Information Science. , Switzerland: Springer, 2015.
Similar publications
Перспективы медиа-мониторинга в исследованиях общественного мнения (на примере доверия президенту)
Ankudinov I., Социология: методология, методы, математическое моделирование 2025 № 61 С. 165–203
The changing political mood of Russians is a constant subject of interest for sociological agencies. With the development of the Internet, conventional questionnaire research began to be supplemented by online surveys and, despite some skepticism, by social media mining. This article attempts to adjust an accidental web-sample so as to bring its estimates closer to ...
Added: April 22, 2026
Алгоритм анализа новостной информации для принятия экономических решений
Ramenskaya A., Чудинова О. С., Первицкая Л. А., Индустриальная экономика 2026 № 1 С. 65–78
This article is devoted to the development of an algorithm for analyzing news information using machine learning methods implemented in Python libraries. The choice of tools used at each stage of the algorithm is justified by calculating metrics for the quality of the solution to the corresponding machine learning problems. The algorithm’s results are presented ...
Added: April 20, 2026
Юсуф-Ходжа и его братья: О родстве Афанасия Никитина
Lifshits A., Slovĕne 2025 Т. 14 № 1 С. 300–312
The article considers those episodes from the notes of Afanasy Nikitin that allow us to doubt his merchant status. Based on the analysis of grammar, vocabulary and pragmatics of Afanasy’s messages, it is concluded that he traveled along the Volga and further as the head of a small community of people and that he differed ...
Added: September 3, 2025
Semantic Text Analysis Using Artificial Neural Networks Based on Neural-Like Elements with Temporal Signal Summation
Kharlamov Alexander, Eugeny S., Kuznetsov D. et al., Problems of Artificial Intelligence 2023 No. 3(30) P. 4–27
Text as an image is analyzed in the human visual analyzer. In this case, the image is scanned along the points of the greatest informativity, which are the inflections of the contours of the equitextural areas, into which the image is roughly divided. In the case of text analysis, individual characters of the alphabet are ...
Added: October 20, 2024
Use of Text Skeleton Structures for the Development of Semantic Search Methods
A. V. Mylnikova, V. A. Trusov, L. A. Mylnikov, Automatic Documentation and Mathematical Linguistics 2023 Vol. 57 No. 5 P. 301–307
This paper considers the problem of the generation of descriptors to reduce data volumes, text data resources, and search times through the use of the new factors of authorship, region, emotive meaning, and popularity, as well as a text category without special marks that can be used to generate descriptors. This approach allows the use ...
Added: February 29, 2024
Investor sentiment and the NFT hype index: to buy or not to buy?
Baklanova V., Kurkin A., Teplova T., China Finance Review International 2024 Vol. 14 No. 3 P. 522–548
Purpose – The primary objective of this research is to provide a precise interpretation of the constructed machine learning model and produce definitive summaries that can evaluate the influence of investor sentiment on the overall sales of non-fungible token (NFT) assets. To achieve this objective, the NFT hype index was constructed as well as several approaches of ...
Added: December 10, 2023
SmartTips: Online Products Recommendations System Based on Analyzing Customers Reviews
Ali N., Alshahrani A., Alghamdi A. et al., Applied Sciences (Switzerland) 2022 Vol. 12 No. 17 Article 8823
Online customers’ opinions represent a significant resource for both customers and enterprises to extract much information that helps them make the right decision. Finding relevant data while searching the internet is a big challenge for web users, known as the “Problem of Information Overload”. Recommender systems have been recognized as a promising way of solving ...
Added: October 4, 2022
A Semi-automated Pipeline for Mapping the Shifts and Continuities in Media Discourse
Shirokanova A., Silyutina O., , in: Digital Transformation and Global Society. 6th International Conference, DTGS 2021, St. Petersburg, Russia, June 23–25, 2021, Revised Selected Papers.: Springer, 2022. P. 19–35.
Added: January 27, 2022
ОЦЕНКА КАЧЕСТВА РАСКРЫТИЯ НЕФИНАНСОВОЙ ИНФОРМАЦИИ ПО СТАНДАРТАМ GRI РОССИЙСКИМИ КОМПАНИЯМИ
Fedorova E., Khrustova L., Демин И. С., AlterEconomics (ранее - Журнал экономической теории) 2020 Т. 17 № 2 С. 412–423
The non-financial information is defined as a significant determinant of the company’s activity in terms of many modern theories. The evolution of the company’s investment attractiveness evaluating theory has led to the conclusion that the determining factors include other non-financial characteristics of the company, such as management structure, degree of social and environmental responsibility and ...
Added: October 23, 2021
Методы классификации текстовых данных: можно ли потенциал количественного анализа использовать в качественном исследовании?
Aleksandrova M., ИНТЕРакция. ИНТЕРвью. ИНТЕРпретация 2021 Т. 13 № 2 С. 81–96
Text mining has developed rapidly in recent years. In this article, we compare classification methods that are suitable for solving problems of predicting item nonresponse. The author builds reasoning about how the analysis of textual data can be implemented in a wider research field based on this material. The author considers a number of metrics ...
Added: August 20, 2021
News headline as a form of news text compression
Kochetkova N. A., Pronoza E., Yagunova E., , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10th International Conference on Social Informatics, SocInfo 2018; St.Petersburg.: Springer, 2018. P. 139–147.
In this paper we analyze news text collections (clusters) via extracting their paraphrase headlines into a paraphrase graph and working with this graph. Our aim is to test whether news headline is an appropriate form of news text compression. Different types of news collections: dynamic, static and combined (both dynamic and static) clusters are analyzed ...
Added: October 30, 2020
ТОНАЛЬНОСТЬ ОСВЕЩЕНИЯ ПОЗИЦИИ РОССИИ В АНГЛОЯЗЫЧНЫХ СМИ В ПЕРИОД САНКЦИЙ
Khrustova L., Федоров Ф. Ю., Fedorova E., Контуры глобальных трансформаций: политика, экономика, право 2020 Т. 13 № 4 С. 292–310
Обострение политической обстановки, которая свойственна текущей стадии развития международных отношений, сопровождается масштабной информационной войной. Проблема освещения положения России в международной прессе с негативной точки зрения обсуждается с начала 2000-х годов. Российско-украинский конфликт, который начался в конце 2013 - начале 2014 годов, заставил иностранные средства массовой информации вновь обратить внимание на Россию и спровоцировал увеличение количества ...
Added: October 29, 2020
Полнота раскрытия нефинансовой информации российскими компаниями: влияние на инвестиционную привлекательность
Khrustova L., Fedorova E., Демин И. С., Российский журнал менеджмента 2020 Т. 18 № 1 С. 51–72
In the context of the development of the digital economy, the role of a company’s information transparency has become increasingly important. Alongside purely financial information, investors are more likely to also take into account the disclosure of non-financial information in the annual accounts. The purpose of this study is to empirically examine the relationship between ...
Added: August 20, 2020
DISTRIBUTIONAL AND NETWORK SEMANTICS. TEXT ANALYSIS APPROACHES
Kharlamov A. A., Pantiukhin D., Gordeev D., , in: Neuroinformatics and Semantic Representations: Theory and Applications.: Cambridge Scholars Publishing, 2020. Ch. 4 P. 55–113.
Abstract. Over the past decade, a new wave of interest in dialogue agents has been observed. This is largely due to the introduction of machine learning in the tasks of automatic natural language processing. Using the tools of distributional and network semantics makes it possible to summarize data from huge corpora of texts. New language ...
Added: June 22, 2020
Application of NLP Algorithms: Automatic Text Classifier Tool
Romanov A., Ekaterina Kozlova, Lomotin Konstantin, , in: Digital Transformation and Global Society. Third International Conference, DTGS 2018, St. Petersburg, Russia, 2018, Revised Selected Papers. Part II. Communications in Computer and Information Science 859Issue 859.: Springer, 2018. P. 310–323.
This research is dedicated to the design of a decision support system for categorization of scientific literature. The purpose of this work is to research possible ways to apply the machine learning algorithms to the automation of manual text categorization. The following stages are considered: preprocessing of raw data, word embedding, model selection, classification model, ...
Added: August 26, 2019
Using Domain Taxonomy to Model Generalization of Thematic Fuzzy Clusters
Frolov D., Mirkin B., Nascimento S. et al., , in: CONTENT 2019, The Eleventh International Conference on Creative Content Technologies.: International Academy, Research, and Industry Association (IARIA), 2019. P. 20–25.
We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its 'head subject' in the higher ranks of the taxonomy tree. The head subject is supposed to 'tightly' cover the query set, possibly bringing in some ...
Added: June 4, 2019
Success Factors of Electronic Petitions at Russian Public Initiative Project: The Role of Informativeness, Topic and Lexical Information
Porshnev A., , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10th International Conference on Social Informatics, SocInfo 2018; St.Petersburg.: Springer, 2018. P. 243–250.
Online petitions are usually regarded as one of the most popular channels to involve citizens in the political process. In our paper we have analyzed texts and voting data (pro and against) from 9705 e-petitions submitted from 2013 until 2017 at Russian Public Initiative project. Analysis of dynamics showed stabilization of interest to this resource (emergence of a new ...
Added: February 12, 2019
Исследовательский проект как инструмент обучения методам анализа текста: предсказание класса поста в социальной сети
Suvorova A., Смирнова К. Р., Будин Е. А. et al., Компьютерные инструменты в образовании 2018 № 3 С. 49–64
The article describes a student research project on predicting the class of a post on a social network based on its textual content. The features of the project are discussed as an integral part of the trajectory of teaching data analysis methods, including text analysis methods and tools that are often not included in machine ...
Added: January 28, 2019
Влияние тональности писем CEO на финансовые показатели компании
Fedorova E., Осетров Р. А., Демин И. С. et al., Российский журнал менеджмента 2017 Т. 15 № 4 С. 441–462
The paper is devoted to the analysis of CEO letters as an instrument for influencing the expectations of shareholders and potential investors. The aim of the research is to analyze empirically the influence of semantic characteristics of CEO letters on financial indicators of the company. The authors suggested that CEO letter’s tonality, its length and ...
Added: October 23, 2018
Digital Humanities в истории психологии (на примере фамилии В.М. Бехтерева)
Костригин А. А., Khusyainov T., Цифровой ученый: лаборатория философа 2018 Т. 1 № 1 С. 160–179
The article discusses the problems and prospects for using the methodology of Digital Humanities in the historical psychological studies. The authors present the results of the search and analysis of the mentions found for the name of the outstanding psychophysiologist, psychoneurolo-gist, and psychologist Vladimir M. Bekhterev (1857-1927) in the body of texts in the Google ...
Added: April 3, 2018
Texterra: A framework for text analysis.
S.D. Kuznetsov, D.Yu. Turdakov, Астраханцев Н. А. et al., Programming and Computer Software 2014 Vol. 40 No. 5 P. 288–295
A framework for fast text analysis, which is developed as a part of the Texterra project, is described. Texterra provides a scalable solution for the fast text processing on the basis of novel methods that exploit knowledge extracted from the Web and text documents. For the developed tools, details of the project, use cases, and ...
Added: November 26, 2017
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit