Information Spaces for Big Data Processing: Unification and Parallelization of Sequential Information Accumulation Procedures
In large-scale research, data are usually collected on many sites, have a huge volume, and new data are constantly generated. Since it is often impossible to collect all the relevant data on a single computer, much attention is paid to the algorithms that provide sequential or parallel accumulation of information and do not need to store all the original data. As an example of information accumulation, the Bayesian updating procedure for linear experiments is analyzed. The corresponding information spaces are defined and the relations between them are studied. It is shown that processing can be unified and simplified by introducing a special canonical form of information representation and transforming all the data and the original prior information into this form. Thanks to the rich algebraic properties of the canonical information space, the sequential Bayesian procedure allows various parallelization options that are ideally suited for distributed data processing platforms, such as Hadoop MapReduce. This opens up the possibility of a flexible and efficient scaling of information accumulation in distributed data processing systems.