• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Annotated suffix trees for text clustering
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 5, 2026
Neural Network Maps as a Method for Constructing Mathematical Models
Scientists from HSE University–Nizhny Novgorod and the Institute of Physics Belgrade, Serbia, are jointly exploring the application of machine learning techniques and neural networks to the study of nonlinear dynamics. Natalya Stankevich, Leading Research Fellow at the Laboratory of Topological Methods in Dynamics of the Faculty of Informatics, Mathematics, and Computer Science at HSE University–Nizhny Novgorod, spoke to the HSE News Service about this international project.
June 5, 2026
‘In the Age of Technology, It Is Interesting to Look into the Past and Think about What We Can Take from It
Polina Tabakova decided to apply for a Philology degree at HSE in Nizhny Novgorod because she grew up in Mari El and did not want to move far away from the Russian forests. In an interview for the Young Scientists of HSE University project, she spoke about the genre of the campus novel, the existential drama of Kolobok, and a blackout version of Eugene Onegin.
June 5, 2026
HSE Scientists Develop Method to Compress Large Language Models Without Losing Quality
Researchers from the AI and Digital Science Institute at the HSE Faculty of Computer Science have developed a new compression method for large language models such as GPT and LLaMA that reduces their size by 25–36% without additional training or significant loss of accuracy. This is the first approach to use mathematical transformations—specifically, rotations of model weights—to make models more amenable to compression with structured matrices. The study results have been published in ACL Findings 2025. The code is available on GitHub.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Annotated suffix trees for text clustering

P. 25–31.
Artemova E., Ilvovsky D.

In this paper an extension of tf-idf weighting on annotated suffix tree (AST) structure is described. The new weighting scheme can be used for computing similarity between texts, which can further serve as in input to clustering algorithm. We present preliminary tests of us-ing AST for computing similarity of Russian texts and show slight im-provement in comparison to the baseline cosine similarity after applying spectral clustering algorithm.

Language: English
Full text
Text on another site
Keywords: clusteringsimilarity measuresannotated suffix tree
Publication based on the results of:
Mining Data with Complex Structure and Semantic Technologies (2016)

In book

The 3d International Workshop on Concept Discovery in Unstructured Data (CDUD 2016). Proceedings of the Third Workshop on Concept Discovery in Unstructured Data co-located with the 13th International Conference on Concept Lattices and Their Applications (CLA 2016), Moscow, Russia, July 18, 2016. CEUR Workshop Proceedings
The 3d International Workshop on Concept Discovery in Unstructured Data (CDUD 2016). Proceedings of the Third Workshop on Concept Discovery in Unstructured Data co-located with the 13th International Conference on Concept Lattices and Their Applications (CLA 2016), Moscow, Russia, July 18, 2016. CEUR Workshop Proceedings
Vol. 1625. , Aachen: CEUR Workshop Proceedings, 2016.
Similar publications
Flexible Stock Market Algorithm
Rubchinskiy A., Chubarova D., Technology and Investment 2025 Vol. 16 No. 4 P. 211–240
The article considers one of the most famous examples of socio-economic systems characterized by significant uncertainty—the S&P-500 stock market, where shares of 500 largest US companies are traded. The flexible algorithm for daily trading has been developed. It is based on known fixed data about cost of shares in previous days as well as on ...
Added: December 19, 2025
Tunnel Clustering Method
F. T. Aleskerov, A. L. Myachin, V. I. Yakuba, Doklady Mathematics 2024 Vol. 110 No. 3 P. 474–479
We propose a novel method for rapid pattern analysis of high-dimensional numerical data, termed tunnel clustering. The main advantages of the method are its relatively low computational complexity, endogenous determination of cluster composition and number, and a high degree of interpretability of final results. We present descriptions of three different variations: one with fixed hyperparameters, ...
Added: March 3, 2025
Использование Z-чисел для описания набора данных
Гусейнов О., Degtyarev K. Y., IRETC MTÜ PAHTEI - Proceedings of Azerbaijan High Technical Educational Institutions 2025 Т. 48 № 1 С. 360–370
The concept of Z-number was proposed by Prof. Lotfi Zadeh to describe partial reliability of information, and it is a kind of fusion of fuzziness and probabilistic uncertainty. Z-number can be presented as a pair of fuzzy numbers Z(A,B) used to describe a value of a random variable X. The first component (A) is a ...
Added: February 20, 2025
Gradient descent clustering with regularization to recover communities in transformed attributed networks
Shalileh S., Social Network Analysis and Mining 2025 Vol. 15212 P. 137–148
Community detection in attributed networks aims to recover clusters in which the within-community nodes are as interconnected and as homogeneous as possible, while the between-communities nodes are as disconnected and as heterogeneous as possible. The current research proposes a straightforward data-driven model with an integrated regularization term to recover communities. For further improvement of the ...
Added: November 30, 2024
An empirical scrutinization of four crisp clustering methods with four distance metrics and one straightforward interpretation rule
T. A. Alvandyan, S. Shalileh, Doklady Mathematics 2024 Vol. 110 No. S1 P. S236–S250
Clustering has always been in great demand by scientific and industrial communities.  However, due to the lack of ground truth, interpreting its obtained results can be debatable. The current research provides an empirical benchmark on the efficiency of three popular and one recently proposed crisp clustering methods. To this end, we extensively analyzed these (four) ...
Added: November 30, 2024
Моделирование оплаты труда учителей в условиях неоднородности социально-экономического состояния регионов
Богданова Т. К., Жукова Л. В., В кн.: XI-я международная конференция «Многомерный статистический анализ, эконометрика и моделирование реальных процессов» имени С.А. Айвазяна.: М.: ЦЭМИ РАН, 2024. С. 41–44.
The paper is devoted to the analysis and forecasting of the average salary of teachers. For 84 regions on the basis of their socio-demographic characteristics according to Rosstat data using Ward's method we obtained a two-cluster solution, which allowed us to identify quite strong differences in the level of wages, GRP per capita, level of ...
Added: October 4, 2024
Threshold Functions and Operations in the Theory of Evidence
Lepskiy A., , in: Belief Functions: Theory and Applications: 8th International Conference, BELIEF 2024, Belfast, UK, September 2–4, 2024, ProceedingsVol. 14909: Lecture Notes in Computer Science.: Cham: Springer, 2024. Ch. 23 P. 216–224.
The article introduces and discusses threshold belief and plausibility functions. When forming such functions, only focal elements that are “significant” for a given set are taken into account. The significance of focal elements is determined using a similarity measure and a threshold. Threshold functionals of uncertainty, external and internal conflicts, threshold rules of combination are ...
Added: September 14, 2024
Aggregation and Ranking on an Ordinal Scale Using Threshold Evidential Combination Rules
Lepskiy A., Procedia Computer Science 2024 Vol. 242 P. 444–451
A new method of aggregation and ranking on an ordinal scale is proposed based on the method of evidential ranking previously developed by the author, but using the tools of threshold aggregation of bodies of evidence. This method has better robustness and stability compared to the threshold-free method. The method allows you to take into ...
Added: September 14, 2024
Clustering with empty clusters
Penikas H. I., Феста Ю. Ю., Известия Дальневосточного федерального университета. Экономика и управление 2024 Vol. 2 P. 75–94
Кластерный анализ широко используется в различных научных и практических областях, связанных с анализом данных. Это важный инструмент для решения задач в таких областях, как машинное обучение, обработка изображений, распознавание текста и т.д. Отсутствие наблюдений не всегда означает отсутствие информации, поэтому предполагается, что наличие пробелов в данных, наличие“пустых” кластеров, также несёт в себе информацию об объекте исследования, как и реальные наблюдения. В этом исследовании предполагается, ...
Added: August 10, 2024
Detecting linguistic variation with geographic sampling
Koile E., Moroz G., Journal of Linguistic Geography 2024 Vol. 12 No. 1 P. 24–31
Geolectal variation is often present in settings where one language is spoken across a vast geographic area. This can be found in phonological, morphosyntactic, and lexical features. For practical reasons, it is not always possible to conduct fieldwork in every single location of interest in order to obtain the full pattern of variation, and a ...
Added: May 6, 2024
Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques
Gromov V., Dang Q. N., , in: 10th International Conference, PReMI 2023, Kolkata, India, December 12–15, 2023, Proceedings. Pattern Recognition and Machine Intelligence. LNCS, volume 14301.: Cham: Springer, 2023. Ch. 3 P. 20–27.
Added: November 29, 2023
Temperature-driven transition into vortex clusters in low-kappa intertype superconductors
Backs A., Al-Falou A., Vagov A. et al., Physical Review B: Condensed Matter and Materials Physics 2023 Vol. 107 No. 17 Article 174527
In the vicinity of the type-I/type-II crossover in conventional superconductors, vortices exhibit a nonmonotonic interaction, which leads to exotic vortex matter states. We perform molecular dynamics simulations on a model superconductor in the intertype regime. In a field cooled approach, we examine the transition of a homogeneous vortex lattice (VL) into a structure consisting of ...
Added: November 2, 2023
Company name matching using job market data enrichment
Andrei A. Ternikov, IT Professional 2024 Vol. 26 No. 2 P. 76–82
This article contributes to the field of matching techniques by introducing a new algorithm based on labor market data enrichment. This approach is able to collect and balance the training and test samples for data integration purposes. By setting thresholds for textual matching and geographic proximity, it simplifies the process of finding suitable company matches. ...
Added: October 26, 2023
2023 Fifth International Conference Neurotechnologies and Neurointerfaces (CNN) 18-20 Sept. 2023
Alshanskaia E., Martynova O., IEEE, 2023.
Cognitive and emotional load in the course of increasing the complexity of tasks leads to the activation of various parts of the autonomic nervous system (ANS) and can be accompanied by an increase in the efficiency of problem solving. An increase in cognitive load under the condition of high motivation is a stress factor and ...
Added: September 24, 2023
Новая программная платформа для моделирования транспортных потоков с участием беспилотных автомобилей
Beklaryan A., Вестник ЦЭМИ 2023 Т. 6 № 1 Статья 5
The article presents a new software platform for modelling traffic flows involving unmanned vehicles, using a number of advanced technological solutions, in particular, the FLAME GPU supercomputer agent modelling framework, intelligent software modules based on fuzzy and hierarchical clustering, genetic optimization algorithms, a subsystem for visualizing the state of agents-vehicles based on OpenGL, etc. As ...
Added: June 4, 2023
Tracing Vortex Clustering in a Superconductor by the Magnetic Flux Distribution
A. Vagov, E. G. Nikonov, The Journal of Physical Chemistry Letters 2023 Vol. 14 No. 15 P. 3743–3748
By investigating spatial configurations of the intermediate mixed state in an intertype superconductor, it is shown that vortex clustering can be characterized by the sample averaged distribution of the penetrating magnetic field. The clustering is manifested in the two peak structure of the distribution. The second peak indicates a spot a material occupies in the ...
Added: June 2, 2023
An empirical comparison of connectivity-based distances on a graph and their computational scalability
Miasnikof P., Shestopaloff A., Pitsoulis L. et al., Journal of Complex Networks 2022 Vol. 10 No. 1 Article cnac003
In this study, we compare distance measures with respect to their ability to capture vertex community structure and the scalability of their computation. Our goal is to find a distance measure which can be used in an aggregate pairwise minimization clustering scheme. The minimization should lead to subsets of vertices with high induced subgraph density. ...
Added: November 21, 2022
Кластеризация шумов как способ оценки функции постоянного сосудистого доступа у больных на гемодиализе
Кравцов П. Ф., Николаев Е. Н., Мазайшвили К. В. et al., Вестник СурГУ. Медицина 2022 Т. 51 № 1 С. 25–30
Abstract. The study aims to develop an algorithm for assessing spectrographic features of arteriovenous fistula dysfunction for hemodialysis. Materials and methods. Forty-four patients with native radiocephalic fistula formed in the distal third of the forearm participated in the research. Using electronic stethoscope, the noise of arteriovenous fistula was recorded in all patients. 653 spectrograms were analyzed with the ...
Added: November 14, 2022
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit