Annotated suffix trees for text clustering

E. Artemova; D. Ilvovsky

?

Annotated suffix trees for text clustering

P. 25-31.

In this paper an extension of tf-idf weighting on annotated suffix tree (AST) structure is described. The new weighting scheme can be used for computing similarity between texts, which can further serve as in input to clustering algorithm. We present preliminary tests of us-ing AST for computing similarity of Russian texts and show slight im-provement in comparison to the baseline cosine similarity after applying spectral clustering algorithm.

Language: English

Full text

Text on another site

Keywords: clustering similarity measures annotated suffix tree

Publication based on the results of:

Mining Data with Complex Structure and Semantic Technologies (2016)

In book

The 3d International Workshop on Concept Discovery in Unstructured Data (CDUD 2016). Proceedings of the Third Workshop on Concept Discovery in Unstructured Data co-located with the 13th International Conference on Concept Lattices and Their Applications (CLA 2016), Moscow, Russia, July 18, 2016. CEUR Workshop Proceedings

Vol. 1625. , Aachen : CEUR Workshop Proceedings, 2016

Analysis and interpretation of imaging mass spectrometry data by clustering mass-to-charge images according to their spatial similarity

Alexandrov T., Chernyavsky I., Becker M. et al., Analytical Chemistry 2013 Vol. 85 No. 23 P. 11189-11195

Imaging mass spectrometry (imaging MS) has emerged in the past decade as a label-free, spatially resolved, and multipurpose bioanalytical technique for direct analysis of biological samples from animal tissue, plant tissue, biofilms, and polymer films. Imaging MS has been successfully incorporated into many biomedical pipelines where it is usually applied in the so-called untargeted mode-capturing spatial localization of a multitude of ions ...

Added: November 18, 2013

Some thoughts on using annotated suffix trees for Natural Language Processing

Artemova E., , in : 2nd Workshop on Interactions Between Data Mining and Natural Language Processing, DMNLP 2015; Porto; Portugal; 7 September 2015. Issue 1410.: Aachen : CEUR-WS, 2015. P. 5-18.

The paper defines an annotated suffix tree (AST) - a data structure used to calculate and store the frequencies of all the fragments of the given string or a collection of strings. The AST is associated with a string to text scoring, which takes all fuzzy matches into account. We show how the AST and ...

Added: October 8, 2015

A Hybrid Approach to the Analysis of a Collection of Research Papers

Mirkin B., Frolov D., Vlasov A. et al., , in : Intelligent Data Engineering and Automated Learning – IDEAL 2020/ 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part II. Vol. 12490: Lecture Notes in Computer Science.: Cham : Springer, 2020. P. 423-433.

We define and find a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. This generalization lifts the set to a “head subject” in the higher ranks of the taxonomy, that is supposed to “tightly” cover the query set, possibly bringing in some errors, both ...

Added: November 13, 2020

A Hybrid Approach to Interpretable Analysis of Research Paper Collections

Mirkin B., Frolov D., Vlasov A. et al., , in : WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. : Association for Computing Machinery (ACM), 2020. P. 184-189.

Added: August 28, 2020

Clustering of Biomedical Data Using the Greedy Clustering Algorithm Based on Interval Pattern Concepts

Galatenko A. V., Nersisyan S., Pankratieva V., , in : Proceedings of the International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI at IJCAI/ECAI 2019). : [б.и.], 2019. P. 65-74.

nterval pattern concepts are a particular case of patternstructures. They can be used to clusterize rows of a numerical formalcontext (data matrix): two rows are close to each other if their entriesat the corresponding positions fall within a given interval.The problem of mining interval pattern concepts has much in commonwith the known problem related to ...

Added: April 28, 2020

Clustering cities based on their development dynamics and Variable neigborhood search

B. S. Zhikharevich, Electronic Notes in Discrete Mathematics 2015 No. 47 P. 213-220

Clustering cities based on their socio-economic development in long time period is an important issue and may be used in many ways, e.g., in strategic regional planning. In this paper we continue our recent study where cumulative attribute for each year replaces nine other attributes, called ’vector of dynamics’. In our previous paper some original ranking method was proposed. ...

Added: November 12, 2015

Russian Nationalist Movement Restructuring in light of the Ukrainian Events which took place in 2013-14

Rotmistrov A., / Social Science Research Network. Series SSRN Working Paper Series "SSRN Working Paper Series". 2015.

The events in Ukraine in 2013-2014 attracted the Russian society’s attention and affected the Russian political agenda. One of the most affected sectors of the Russian domestic policy was Russian nationalist organizations. The issue of radical nationalism has become essential for European countries and for Russia in particular. But this object is rather difficult to ...

Added: October 15, 2015

Ignatov D. I., Sarwar S. M., Hasan M. et al., , in : Analysis of Images, Social Networks and Texts. 4th International Conference, AIST 2015, Yekaterinburg, Russia, April 9–11, 2015, Revised Selected Papers. Vol. 542: Series: Communications in Computer and Information Science.: Switzerland : Springer, 2015.

In this paper we show how several similarity measures can be combined for finding similarity between a pair of users for performing Collaborative Filtering in Recommender Systems. Through aggregation of several measures we find super similar and super dissimilar user pairs and assign a different similarity value for these types of pairs. We also introduce ...

Added: November 24, 2015

Кластеризация агентов в модели ограниченного соседства

Akopov A. S., Beklaryan A., Искусственные общества 2020 Т. 15 № 3 С. 1-11

This article presents a new approach to designing agent-based bounded neighbourhood models (the Schelling’s models). An original agent-based model in the AnyLogic system has been developed, which describes the segregation processes caused by the behaviour patterns of agent-individuals. There are examined various scenarios (environment characteristics) affecting the cluster structure of the spatial distribution of agents. Using the proposed bounded ...

Added: September 14, 2020

The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning

Mirkin B., Amorim R., Makarenkov V. et al., Pattern Recognition 2017 Vol. 67 P. 62-72

The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the ...

Added: March 30, 2017

Comparison of String Similarity Measures for Obscenity Filtering

Artemova E., , in : Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. : Stroudsburg, PA : Association for Computational Linguistics, 2017. P. 97-101.

In this paper we address the problem of filtering obscene lexis in Russian texts. We use string similarity measures to find words similar or identical to words from a stop list and establish both a test collec- tion and a baseline for the task. Our exper- iments show that a novel string similarity measure based ...

Added: October 10, 2017

Usage of Clustering of Paley Graphs in Polar Coordinates for the Development of New Network on Chip Topologies

Alijon F. Fatullaev, Edward R. Rzaev, Aleksandr Yu. Romanov, , in : 2022 International Russian Automation Conference (RusAutoCon). : IEEE, 2022. P. 419-423.

The article presents a study of clustering of Paley graphs with the arrangement of prime numbers in polar coordinates and a comparison of the resulting groups in terms of their static parameters; the application of fault-tolerant self-organizing routing method for new topologies is also considered. This article is a continuation of a series of articles ...

Added: October 2, 2022

ОБРАБОТКА И АНАЛИЗ РЕЗУЛЬТАТОВ МОНИТОРИНГОВ ДЛЯ УПРАВЛЕНИЯ ФОРМИРОВАНИЕМ УСЛОВИЙ КАЧЕСТВЕННОГО ОБРАЗОВАНИЯ

Shvindt A., Моделирование, оптимизация и информационные технологии 2017 Т. 5 № 4 С. 1-18

The article reviews models and procedures for processing and evaluation of monitoring results, including student participation, focused on intellectual support of administrative managerial decisions when developing of conditions and corresponding resources for the achievement of applicable regulatory requirements for the quality of university education. The first stage of processing is normalization of factors which characterize ...

Added: August 19, 2019

Preface to the special issue on “Clustering and search techniques in large scale networks”

Pardalos P. M., Kalyagin V. A., Optimization Letters 2017 Vol. 11 No. 2 P. 247-247

Clustering and search techniques are essential to a wide spectrum of applications. Network clustering techniques are becoming common in the analysis of massive data sets arising in various branches of science, engineering, government and industry. In particular, network clustering and search techniques emerge as an important tool in large-scale networks. This special issue of Optimization Letters ...

Added: September 12, 2016

Conceptual maps: construction over a text collection and analysis

E. Morenko, Artemova E., Mirkin B., , in : Analysis of Images, Social Networks and Texts Third International Conference, AIST 2014, Yekaterinburg, Russia, April 10-12, 2014, Revised Selected Papers. Vol. 439.: Berlin : Springer, 2014. P. 163-169.

A method for conceptual maps construction is presented and applied to three different domains. A conceptual map is graph, where nodes stand for domain specific concepts and edges connect associated concepts. The conceptual map reveals and visualizes the logical asso- ciations between concepts, which exist in the collection of texts, used to construct the conceptual ...

Added: November 28, 2014

Formal Concept Analysis: 16th International Conference, ICFCA 2021, Strasbourg, France, June 29 – July 2, 2021, Proceedings

Springer, 2021

This book constitutes the proceedings of the 16th International Conference on Formal Concept Analysis, ICFCA 2021, held in Strasbourg, France, in June/July 2021. The 14 full papers and 5 short papers presented in this volume were carefully reviewed and selected from 32 submissions. The book also contains four invited contributions in full paper length. The research part ...

Added: July 10, 2021

Multiple Access Communications

Halmstad : Springer, 2014

This book constitutes the refereed proceedings of the 7th International Workshop on Multiple Access Communications, MACOM 2014, held in Halmstad, Sweden, in August 2014. The 12 full papers presented were carefully reviewed and selected from 22 submissions. They describe the latest advancements in the field of multiple access communications with an emphasis on reliability issues, ...

Added: October 29, 2014

Структура системы автоматической обработки русскоязычных текстов

Dubov M., Mirkin B., Шаль А. А., Открытые системы. СУБД 2014

A main tendency in the applied informattics currently is the growing activity in developing text analytics methods and software. The software oriented at analysis, summarization and visualization of unstructrured texts varies from simple tag builders like Wordle to more intelligent SAS Text Miner, IBM Watson, IBM Content Analytics and the like. Almost all of them are for foreign languages ...

Added: December 10, 2014

Tech Mining for Emerging STI Trends Through Dynamic Term Clustering and Semantic Analysis: The Case of Photonics

Bakhtin P., Saritas O., , in : Anticipating Future Innovation Pathways Through Large Data Analysis. : Netherlands : Springer, 2016. P. 341-360.

Technology mining (TM) helps to acquire intelligence about the evolution of research and development (R&D), technologies, products, and markets for various STI areas and what is likely to emerge in the future by identifying trends. The present chapter introduces a methodology for the identification of trends through a combination of “thematic clustering” based on the ...

Added: June 20, 2016

APPLICATION OF DATA ENVELOPMENT ANALYSIS IN MANAGEMENT RESEARCH (CASE OF RUSSIAN DOMESTIC ENERGY SECTOR)

Volkova I., / Высшая школа экономики. Series MAN "Management". 2013.

The idea that different firms can be classified into relatively homogeneous groups has been popular for many years, and many typologies have been developed and tested using a variety of classification tools. It has become apparent, however, that most clustering tools are somewhat limited, because they create groups of companies based on similar characteristics, without ...

Added: February 18, 2014

Кластеризация медицинских больших данных как инструментарий систем поддержки принятия решений в математической кардиологии с использованием облачных технологий

Shmid A., Новопашин М. А., Зимина Е. Ю., Системный администратор 2018 Т. 188-189 № 07-08 С. 92-96

Массовое использование мобильных устройств для съема электрокардиограмм (ЭКГ) приводит к количественному росту доступных для исследования ЭКГ множества пациентов. Таким образом, появляются новые возможности исследования колебательных процессов долговременной динамики индивидуального состояния сердечно-сосудистой системы (ССС) любого пациента. В статье демонстрируются новые возможности долговременного постоянного наблюдения за состоянием ССС массы пациентов, позволяющие выявить закономерности динамики ССС, которые приводят к ...

Added: September 13, 2018

Types of Dropout in Adaptive Open Online Courses

Skryabin M., , in : Lecture Notes in Computer Science. Vol. 10254: Digital Education: Out to the World and Back to the Campus.: Springer, 2017. P. 273-279.

This study is devoted to different types of students’ behavior before they drop an adaptive course. The Adaptive Python course at the Stepik educational platform was selected as the case for this study. Student behavior was measured by the following variables: number of attempts for the last lesson, last three lessons solving rate, the logarithm ...

Added: March 3, 2019

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

Korenjak–Cerne S., Kejzar N., Batagelj V., , in : Advances in Data Sciences: Symbolic, Complex and Network Data. : ISTE, Wiley, 2020. P. 209-228.

...

Added: December 10, 2019

Кластерный анализ кардиологических данных

Зимина Е. Ю., Статистика и Экономика 2018 Т. 15 № 2 С. 30-37

The article includes the observation of the cluster analysis of medical data on the example of the cardiac data. One of the main effective and commonly used Data Mining methods that applied to the large amounts of information (for example, mathematical economics) are clustering methods: the search for signs of similarity between objects in the study of the subject area ...

Added: May 29, 2018