Towards Machine Learning in Distributed Array DBMS: Networking Considerations
Computer networks are veins of modern distributed systems. Array DBMS (Data Base Management Systems) operate on big data which is naturally modeled as arrays, e.g. Earth remote sensing data and numerical simulation. Big data makes array DBMS to be distributed and highly utilize computer networks. The R&D area of array DBMS is relatively young and machine learning is just paving its way to array DBMS. Hence, existing work is this area is rather sparse and is just emerging. This paper considers distributed, large matrix multiplication (LMM) executed directly inside array DBMS. LMM is the core operation for many machine learning techniques on big data. LMM directly inside array DBMS is not well studied and optimized. We present novel LMM approaches for array DBMS and analyze the intricacies of LMM in array DBMS including execution plan construction and network utilization. We carry out performance evaluation in Microsoft Azure Cloud on a network cluster of virtual machines, report insights derived from the experiments, and present our vision for the future machine learning R&D directions based on LMM directly inside array DBMS.