Reproducible and Reliable Distributed Classification of Text Streams

?

Reproducible and Reliable Distributed Classification of Text Streams

Association for Computing Machinery (ACM), 2019.

Novikov B.

In press

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.

Reproducible and Reliable Distributed Classification of Text Streams

Machine learning approach for scientific and technical expertise

A. V. Belov, E. A. Egorova, Bulletin D. Serikbayev East Kazakhstan Technical University 2023 No. 4 P. 92-102

When conducting scientific and technical expertise, it is necessary to analyze the texts of reports on scientific research work. The analysis is carried out in order to determine whether the research being conducted belongs to the class of scientific research and development work in the field of IT. This article discusses the tasks of binary ...

Added: March 9, 2024

ML Reprobucibility Challenge 2022

[б.и.], 2023

Added: November 2, 2023

Classification of Short Scientific Texts

I. K. Kusakin, Fedorets O. V., A. Y. Romanov, Scientific and Technical Information Processing 2023 Vol. 50 No. 3 P. 176-183

This paper discusses modern approaches to natural language processing and the application of machine learning models to the task of classifying short scientific texts in Russian. This study is devoted to the analysis of methods for vectorization of textual information, selection of a model for scientific paper clas- sification, and training of linguistic model BERT ...

Added: November 4, 2023

Проблема классификации текстов и дифференцирующие признаки

Polyakov I. V., Соколова Т. В., Chepovskiy A. et al., Вестник Новосибирского государственного университета. Серия: Информационные технологии 2015 Т. 13 № 2 С. 55-63

This paper presents a text classification method based on mutual information method. It was shown that word stems are universal features for text classification problem ...

Added: October 24, 2015

MLDev: Data Science Experiment Automation and Reproducibility Software

Anton Khritankov, Pershin N., Ukhov N. et al., , in : Data Analytics and Management in Data Intensive Domains. 23rd International Conference, DAMDID/RCDL 2021, Moscow, Russia, October 26–29, 2021, Revised Selected Papers. : Springer, 2022. P. 3-18.

Added: September 20, 2022

Использование BERT для классификации коротких научных текстов на русском языке

Кусакин И. К., Цурупа А. М., Алмакаев А. В. et al., В кн. : НТИ-2022. Научная информация в современном мире: глобальные вызовы и национальные приоритеты : материалы 10-ой научной конференции с международным участием, посвященной 70-летию ВИНИТИ РАН, Москва, 25–26 октября 2022 года. : М. : ВИНИТИ РАН, 2022. С. 103-109.

This work is devoted to the study of approaches for training BERT-based classifiers of scientific articles to implement the application with the adoption of the best models for use in the infrastructure of the VINITI RAS. For this purpose, the BERT linguistic model was trained on a specialized corpus of scientific texts for subsequent use ...

Added: January 31, 2023

Examining the generalizability of research findings from archival data

Delios A., Clemente E. G., Wu T. et al., Proceedings of the National Academy of Sciences of the United States of America 2022 Vol. 119 No. 30 Article e2120377119

This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports ...

Added: July 19, 2022

FlameStream: Model and Runtime for Distributed Stream Processing

Kuralenok I., Trofimov A., Marshalkin N. et al., , in : Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. : ACM, 2018. P. 8:1-8:2.

Exactly-once semantics without high latency overhead is still hard to achieve within state-of-the-art stream processing systems.We introduce a model providing for exactly-once using lightweight optimistic approach for obtaining determinism and idempotence. We show its feasibility with a prototype. ...

Added: February 13, 2019

Automatic Recognition of Messages from Virtual Communities of Drug Addicts

Фирсанова В. И., Journal of Applied Linguistics and Lexicography 2021

The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts’ ...

Added: September 25, 2023

Research of Neural Networks Application Efficiency in Automatic Scientific Articles Classification According to UDC

Romanov A., Lomotin K.E., Kozlova E.S. et al., , in : 2016 International Siberian Conference on Control and Communications (SIBCON). Proceedings. : M. : HSE, 2016. Ch. 543fu4t.

In this work realization of automatic scientific articles classification according to Universal Decimal Classifier is presented. Efficiency of neural networks technologies application for current task is researched, and optimal neural network structure and parameters are offered ...

Added: June 11, 2016

Использование вероятностного распределения над множеством классов в задаче классификации арабских диалектов

Durandin O., Zolotykh N., Хилал Н. Р. et al., Научно-технический вестник информационных технологий, механики и оптики 2017 № 1(107) С. 110-116

Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over ...

Added: February 8, 2017

The Presence of Order-Effect Bias in Moscow Administration

Dmitry Romanov, Kazantsev N., Edgeeva E., , in : Business Process Management: Blockchain and Central and Eastern Europe Forum. BPM 2019. Vol. 361.: Springer, 2019. P. 337-341.

This paper studies ‘the order effect’ in decision making based on classification results of 120 000 citizen claims to Moscow Government. We use machine learning methods and derive that with 60% probability the first out of two consequent claims is prioritized. We conclude that this impact must be considered whilst developing artificial intelligence units. ...

Added: October 26, 2020

Profiling the Age of Russian Bloggers

Litvinova T., Sboev A., Panicheva P., , in : Artificial Intelligence and Natural Language, 7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, Proceedings. Issue 930.: Switzerland : Springer, 2018. P. 167-177.

The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area ...

Added: February 19, 2019

Применение методов машинного обучения для решения задачи автоматической рубрикации статей по УДК

Romanov A., Ломотин К. Е., Козлова Е. С., Информационные технологии 2017 Т. 23 № 6 С. 418-423

The paper deals with the applicability of modern machine learning methods to the problem of automatic generation of UDC for scientific articles. As the classifiers, such models as artificial neural networks, logistic regression and boosting are considered. Graph algorithms and a prototype software module to generate UDC are designed. ...

Added: July 30, 2017

Разработка аппаратного модуля классификации текстовых документов на базе ПЛИС

Ломотин К. Е., Romanova I., В кн. : ФЭЭ 2017: Физика, Электроника, Электротехника. Материалы научно-технической конференции. : Сумы : СумДу, 2017. С. 152-152.

В процессе обработки текстов остро встает проблема нехватки производительности. Семантические и статистические модели документов требуют сложных вычислений, которые могут занимать длительное время. Эта проблема является преградой на пути внедрения последних разработок в области классификации текстов. В данной работе рассматривается проект аппаратного модуля, реализующего классификацию входящих документов по заданным тематикам. ...

Added: July 31, 2017

Text classification with deep learning neural networks

Voronkov Ilia, Amajd M., Kaimuldenov Z., , in : Actual Problems of System and Software Engineering 2017. Proceedings of the 5th International Conference on Actual Problems of System and Software Engineering Supported by Russian Foundation for Basic Research. Project #17-07-20565 Moscow, Russia, November 14-16, 2017, 408 P. Vol. 1989.: Aachen : CEUR Workshop Proceedings, 2017. P. 362-370.

In this paper, we analyze the use of different neural networks for the text classification task. The accuracy of the studied text classifiers can be changed by a small number of previously classified texts. This is important due to the fact that in many applications of text classification a large number of unlabeled texts are easily accessible, while ...

Added: August 16, 2018

TextAnalyst Technology for Automatic Semantic Analysis of Text

Kharlamov A. A., , in : Neuroinformatics and Semantic Representations: Theory and Applications. : Cambridge Scholars Publishing, 2020. P. 156-167.

На основе представлений об обработке информации в мозге человека [1] реализована технология автоматической смысловой обработки текстов TextAnalyst, позволяющая выявить ключевые понятия текста в их взаимосвязях, реализовать реферирование текстов и их смысловое сравнение (классификацию). Реализованы продукты, использующие функциональность этой технологии: персональный – TextAnalyst, и библиотека COM модулей – TextAnalyst SDK. ...

Added: December 7, 2021

A Deep Learning Method Study of User Interest Classification

Malafeev A., Nikolaev K., , in : Analysis of Images, Social Networks and Texts. 8th International Conference, AIST 2019, Kazan, Russia, July 17–19, 2019, Revised Selected Papers. Communications in Computer and Information Science. Vol. 1086.: Springer, 2020. P. 154-159.

In this paper, a deep learning method study is conducted to solve a new multiclass text classification problem, identifying user interests by text messages. We used an original dataset of almost 90 thousand forum text messages, labeled for ten interests. We experimented with different modern neural network architectures: recurrent and convolutional, as well as simpler ...

Added: November 7, 2019

Pulse of the Nation: Observable Subjective Well-Being in Russia Inferred from Social Network Odnoklassniki

Sergey Smetanin, Mathematics 2022 Vol. 10 No. 16 Article 2947

Policymakers and researchers worldwide are interested in measuring the subjective well-being (SWB) of populations. In recent years, new approaches to measuring SWB have begun to appear, using digital traces as the main source of information, and show potential to overcome the shortcomings of traditional survey-based methods. In this paper, we propose the formal model for ...

Added: August 15, 2022

Эмоциональный анализ постов в ВКонтакте: классификатор или регрессор

Kolmogorova A., Калинин А. А., В кн. : Компьютерная лингвистика и интеллектуальные технологии: по материалам международной конференции «Диалог 2022», выпуск 21. Вып. 21.: Изд-во РГГУ, 2022. С. 311-322.

The article summarizes the results of two tasks in machine learning paradigm: the task of classification according to the criterion of dominating emotion on the data of social networks posts in Russian and the regression task using the same data. The experiments are conducted on the data set collected from VKontakte social network and consisted of 3879 posts ...

Added: March 18, 2024

Интеллектуальный анализ текстов в социальных науках

Byzov A., Социология: методология, методы, математическое моделирование 2019 № 49 С. 131-160

Throughout most of their history, sociologists have sought to study unstructured organic texts: newspaper materials, diaries, memoirs, letters, documents, and, more recently, messages, publications and other texts on various online platforms. This article discusses how modern techniques of text mining can improve classical sociological approaches to the analysis of this type of data. The article ...

Added: December 9, 2019

Исследование методов машинного обучения для классификации научных текстов на русском языке

Кусакин И. К., Федорец О. В., Romanov A., Научно-техническая информация. Серия 2: Информационные процессы и системы 2022 Т. 12 С. 6-9

This paper discusses modern approaches to natural language processing and appliance of artificial intelligence technologies in the task of classifying scientific texts in Russian. The report contains an analysis of implementations of text vectorization methods, a description of experiments with training various classifier models: from classical machine learning algorithms to neural network transformer architectures. ...

Added: January 31, 2023

Near-Zero-Shot Suggestion Mining with a Little Help from WordNet

Alekseev A., Tutubalina E., Kwon S. et al., , in : Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. : Cham : Springer, 2022. P. 23-36.

In this work we explore the constructive side of online reviews: advice, tips, requests, and suggestions that users provide about goods, venues and other items of interest. To reduce training costs and annotation efforts needed to build a classifier for a specific label set, we present and evaluate several entailment-based zero-shot approaches to suggestion classification ...

Added: April 10, 2023

Selection of Pseudo-Annotated Data for Adverse Drug Reaction Classification Across Drug Groups

Alimova I., Tutubalina E., , in : Analysis of Images, Social Networks and Texts. 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. : Cham : Springer, 2022. P. 37-44.

Automatic monitoring of adverse drug events (ADEs) or reactions (ADRs) is currently receiving significant attention from the biomedical community. In recent years, user-generated data on social media has become a valuable resource for this task. Neural models have achieved impressive performance on automatic text classification for ADR detection. Yet, training and evaluation of these methods ...

Added: April 10, 2023