Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.
Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. Main Results. The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. Practical Relevance. Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.
Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, but do not support the same level of fault tolerance and determinism as the batch systems. In this work, we demonstrate how the distributed stream processing features can affect the results of a typical text classification data flow. Our analysis shows emerged trade-offs between fault tolerance and reproducibility on the one side, and performance on the other side. We outline potential ways to solve the revealed issues and to handle streaming features.
We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.
We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.
We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.