Distributed data replication and access optimization for LHCb storage system - A Position Paper
This paper presents how machine learning algorithms and methods of statistics can be implemented to data management in hybrid data storage systems. Basicly, two di↵erent storage types are used to store data in the hybrid data storage systems. Keeping low-frequenty used data on cheap and slow storages of type one and high-frequently used data on fast and expensive storages of type two helps to achieve optimal performance/cost ratio for the system. We use classification algorithms to estimate probability that the data will high-frequently used in future. Then, using the risks analysis we define where the data should be stored. We show how to estimate optimal number of replicas of the data using regression algorithms and Hidden Markov Model. Based on the probability, risks and the optimal nuber of data replicas our recommendation system finds optimal data distribution in the hybrid data storage system. We present the results of our method implementation in LHCb hybrid data storage.