Application of NLP Algorithms: Automatic Text Classifier Tool
This research is dedicated to the design of a decision support system for categorization of scientific literature. The purpose of this work is to research possible ways to apply the machine learning algorithms to the automation of manual text categorization. The following stages are considered: preprocessing of raw data, word embedding, model selection, classification model, and software design. At the first stage, in collaboration with VINITI RAS, the training set of 200,000 Russian texts was formed. At the second stage, the word embedding model was justified as Word2Vec vector representation from text matrix by “sum” convolution with dimensionality 1500. At the third stage, the quality of the classifiers was estimated, and the logistic regression algorithm with the highest F1 score (0.94) was selected. And at the final stage, the ATC (Automatic Text Classifier) application, which embeds the results obtained on the previous stages, was developed. The overall application structure was described. It consists of compact program modules that can be replaced or adapted to the incoming text and gain the most classification score.