Reproducible and Reliable Distributed Classification of Text Streams}

Trofimov A.; Shavkunov M.; Reznick S.; Sokolov N.; Zotov M.; Kuralenok I.; B. Novikov

doi:10.1145/3328905.3332514

Publications

?

Reproducible and Reliable Distributed Classification of Text Streams}

P. 264–265.

Trofimov A., Shavkunov M., Reznick S., Sokolov N., Zotov M., Kuralenok I., Novikov B.

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.

Language: English

DOI

Text on another site

Keywords: automatic classification data stream

In book

Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems

NY: Association for Computing Machinery (ACM), 2019.

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions

Trofimov A., Sokolov N., Shavkunov M. et al., , in: Proceedings of Real-Time Business Intelligence and Analytics. Association for Computing Machinery (ACM), 2019. Ch. 2 P. 1–6.

Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, but do not support the same level of fault tolerance and determinism as the batch systems. In this work, ...

Added: November 1, 2019

Divisive-Agglomerative Algorithm and Complexity of Automatic Classification Problems

Rubchinskiy A., / NRU Higher School of Economics. Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2015. No. WP7/2015/09.

An algorithm of solution of the Automatic Classification (AC for brevity) problem is set forth in the paper. In the AC problem, it is required to find one or several partitions, starting with the given pattern matrix or dissimilarity / similarity matrix. The three-level scheme of the algorithm is suggested. The output of the procedure ...

Added: October 19, 2017

Cipher, transform, get lost: an anti-transparent system for distance measurement in East Slavic lects

Afanasev I., Journal of Language Relationship 2023 Vol. 21 No. 3-4 P. 159–177

Recent advances in computational historical linguistics have inspired a discussion on newly implemented quantitative methods — mainly, it is about their lack of transparency, and the ways to overcome it. This paper aims to demonstrate the advantages of transparency for such tools. The study compares two types of language distance measurement systems used in classification. ...

Added: May 15, 2024

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023.

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). Association for Computational Linguistics, 2023. P. 174–186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Дивизимно-агломеративный алгоритм классификации на основе минимаксной модификации частотного подхода

Rubchinskiy A., / Высшая школа экономики. Серия WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2010. № 07.

The conventional problem of automatic classification (AC for brevity) is considered. The suggested approach is based on the new combinations of known methods and their modifications. At first, consecutive dichotomies of the initial set are produced, whereby a family of classifications consisting of 2, 3, …, k subsets is constructed, where k is a number ...

Added: March 23, 2013

Использование вероятностного распределения над множеством классов в задаче классификации арабских диалектов

Durandin O., Zolotykh N., Хилал Н. Р. et al., Научно-технический вестник информационных технологий, механики и оптики 2017 № 1(107) С. 110–116

Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over ...

Added: February 8, 2017

FAMILY OF GRAPH DECOMPOSITIONS AND ITS APPLICATIONS TO DATA ANALYSIS

Rubchinskiy A., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2016. No. WP7/2016/09.

A new decomposition approach to complex systems analysis is suggested. The conventional approach deals with the construction of a single, “the most correct”, decomposition of the considered system. Meanwhile the suggested approach is oriented to the construction of a family of decompositions, whose properties reveal some important meaningful features of the initial system. The expedience ...

Added: October 20, 2017

23rd International Symposium on Methodologies for Intelligent Systems - Proceedings

Birkhauser/Springer, 2017.

This book constitutes the proceedings of the 23rd International Symposium on Foundations of Intelligent Systems, ISMIS 2017, held in Warsaw, Poland, in June 2017. The 56 regular and 15 short papers presented in this volume were carefully reviewed and selected from 118 submissions. The papers include both theoretical and practical aspects of machine learning, data mining ...

Added: September 18, 2017

Устройство универсальной перепаковки потоков данных

Aminev D., В кн.: Молодые ученые – 2008 Материалы Международной научно-технической школы-конференции «Молодые ученые – науке, технологиям и профессиональному образованию»Т. 4. М.: Энергоатомиздат, 2008. С. 23–26.

The existing device repacking data streams and options for their implementation as application specific integrated circuits, so on the FPGA is studied. Revealed their limitations and shortcomings of the synchronization of the data flow transformation. A device universal repacking data streams is offered. The function chart and timing diagrams of his work is shown. ...

Added: July 13, 2013