### Article

## Использование вероятностного распределения над множеством классов в задаче классификации арабских диалектов

**Subject of Research.**We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. **Method. **Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. **Main Results.** The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. **Practical Relevance.** Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.

The extremely important role of information in the modern world has led to the identification of information as an own resource, as important and necessary as energy, financial, raw materials. The needs of society in the collection, storage and processing of information as a commodity have created a new range of services – the information technology market. The volumes of information are growing rapidly, such kind of data volume is called "Big Data», and has been offered for analysis. In order to solve management problems based on the analysis of such data, it is necessary to take into account their heterogeneity, high degree of variation. Therefore, the systematization and grouping of the information obtained makes it possible to improve the quality of the decisions made in the planning and production management tasks. In the process of choosing the grouping methods, the greater dimensionality of the data that affects the processing time of information should be taken into account besides the type of the task in hand. This work presents the results of research of the methods of grouping data for a certain range of practical problems in the processing of large data, as well as the results of solving various practical management problems using various methods

The collection includes articles by linguists from Russia, Uzbekistan, Algeria, Jordan. The papers discuss pressing issues of methodology of teaching Arabic language, as well as the theoretical questions of traditional Arabic grammatical theory and literature. A comparative analysis of the adopted Arabic vocabulary is presented. The collection may be useful for arabists, teachers, graduate students and all those interested in Arabic language, literature and methods of teaching of Arabic.

This project describes an application for creating ubiquitous hypertext on the Web, which enhances the user experience by allowing clipping and sharing the information. The goal of the application is to annotate text and link it to relevant content, especially from the Linked Open Data community and from the Ontos knowledge base. The paper describes two use cases and highlights the main functionality of the application.

*The paper deals with the applicability of modern machine learning methods to the problem of automatic generation of UDC for scientific articles. As the classifiers, such models as artificial neural networks, logistic regression and boosting are considered. Graph algorithms and a prototype software module to generate UDC are designed.*

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.

The creation of software of analyst workplace supporting the mining process large amounts of statistical data of science, education and innovation are discussed in the paper. A hybrid approach, to the integration of classical methods of mathematical correlation analysis, pattern analysis and time series, as well as the interpretation of the results is provided. Particular attention is paid to the business processes to identify trends and changes in indicators, atypical dynamics of indicators and to the definition of «Best Performance» indicators vectors.

Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In Open- Corpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report on our preliminary results.

We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.

We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.

We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.

This proceedings publication is a compilation of selected contributions from the “Third International Conference on the Dynamics of Information Systems” which took place at the University of Florida, Gainesville, February 16–18, 2011. The purpose of this conference was to bring together scientists and engineers from industry, government, and academia in order to exchange new discoveries and results in a broad range of topics relevant to the theory and practice of dynamics of information systems. Dynamics of Information Systems: Mathematical Foundation presents state-of-the art research and is intended for graduate students and researchers interested in some of the most recent discoveries in information theory and dynamical systems. Scientists in other disciplines may also benefit from the applications of new developments to their own area of study.