Specific Features of Big Data Processing and the Concept of Information
The Data in “big data” sets, as a rule, have a huge volume, are distributed among numerous sites and are constantly replenished. As a result even a simplest analysis of big data faces serious difficulties. To apply traditional processing all the relevant data has to be collected in one place and arranged in the form of convenient structures. Only then the corresponding algorithm processes these structures and produces the result of analysis. In the case of big data, it can be just impossible to collect all the relevant data on one computer, and even impractical, since one computer would not be able to process them in a reasonable time. An appropriate data analysis algorithm should, working in parallel on many computers, extract from each set of raw data some intermediate compact “information”, gradually combine and update it, and finally, use the accumulated information to produce the result. Upon arrival of new pieces of data, it should be able to add them to the accumulated information and eventually renew the result. We will discuss specific features of such well-arranged intermediate form of information, reveal its natural algebraic properties, and present several examples. We will also see that in many important data processing problems the appropriate information space may become equipped with an ordering which reflects the “quality” of the information. It turns out that such an intermediate form of information representation in some sense reflects the very essence of the information contained in the data. This leads us to a completely new, ‘practical’ approach to the notion of information.