Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras
Procedures of sequential updating of information are important for “big data streams” processing because they avoid accumulating and storing large data sets. As a model of information accumulation, we study the Bayesian updating procedure for linear experiments. Analysis and gradual transformation of the original processing scheme in order to increase its efficiency lead to certain mathematical structures - information spaces. We show that processing can be simplified by introducing a special intermediate form of information representation. Thanks to the rich algebraic properties of the corresponding information space, it allows unifying and increasing the efficiency of the information updating. It also leads to various parallelization options for inherently sequential Bayesian procedure, which are suited for distributed data processing platforms, such as MapReduce. Besides, we will see how certain formalization of the concept of information and its algebraic properties can arise simply from adopting data processing to big data demands. Approaches and concepts developed in the paper allow to increase efficiency and uniformity of data processing and present a systematic approach to transforming sequential processing into parallel.