Proceedings of the Ivannikov ISPRAS Open Conference, 2–3 December 2020 Moscow
These days, real-time analytics is one of the most often used notions in the world of databases. Broadly, this term means very fast analytics over very fresh data. Usually the term comes together with other popular terms, hybrid transactional/analytical processing (HTAP) and in-memory data processing. The reason is that the simplest way to provide fresh operational data for analysis is to combine in one system both transactional and analytical processing. The most effective way to provide fast transactional and analytical processing is to store an entire database in memory. So on the one hand, these three terms are related but on the other hand, each of them has its own right to life. In this paper, we provide an overview of several in-memory data management systems that are not HTAP systems. Some of them are purely transactional, some are purely analytical, and some support real-time analytics. Then we overview nine in-memory HTAP DBMSs, some of which don't support real-time analytics. Existing real-time in-memory HTAP DBMSs have very diverse and interesting architectures although they use a number of common approaches: multiversion concurrency control, multicore parallelization, advanced query optimization, just in time compilation, etc. Additionally, we are interested whether these systems use non-volatile memory, and, if yes, in what manner. We conclude that an emergence of new generation of NVM will greatly stimulate its use in in-memory HTAP systems.
Modern supervised algorithms assume that the dataset used for training has the same distributions as the data to be processed. However, the real data is permanently changing. This leads to the gradual degradation of supervised machine learning algorithms in production systems and increases the cost of the maintaining. To solve this problem, we are focusing on domain adaptation of machine learning algorithms in lifelong manner. We assume that real unlabelled data come in continuously. For this setting we propose a method for detecting changes in data distributions, as well as updating supervised algorithms. The idea behind the method is to process a portion of the data and create a new labelled dataset for training a supervised model. The trained model becomes a part of the ensemble used for selecting a strategy to deal with new examples: assign the label automatically using co-training or manually with the aid of active learning. This method is independent of the specific architecture of the model and could be used with any modern supervised algorithms, including artificial neural networks. Our research also confirms two findings. First, adding small portion of data with reliable labels to a self-labelled dataset improves model's performance, even if this amount is small to build a model from scratch. It is also shown that accumulating domain knowledge by continuously adding new trained models to ensemble used for labelling, reduces the amount of labelled data required while maintaining the high performance of the adapted model.