Efficient Exact Algorithm for Count Distinct Problem
This paper describes and analyses optimization approaches, which make possible the exact calculation of millions of hierarchical count distinct measures over hundreds of billions data rows. Described approach evolved for several years, in parallel with the growth of tasks from a fast growing internet company, and was finally implemented as a PEAPM (Pipelined Exact Accumulation for Paralleled Measures) algorithm. Current version of an algorithm outputs exact values (not estimates), works in a single thread, in minutes using a general commodity hardware, and requires volume of RAM equal to the doubled size of required measures/
Pattern structures, an extension of FCA to data with complex descriptions, propose an alternative to conceptual scaling (binarization) by giving direct way to knowledge discovery in complex data such as logical formulas, graphs, strings, tuples of numerical intervals, etc. Whereas the approach to classification with pattern structures based on preceding generation of classifiers can lead to double exponent complexity, the combination of lazy evaluation with projection approximations of initial data, randomization and parallelization, results in reduction of algorithmic complexity to low degree polynomial, and thus is feasible for big data.
This book constitutes the refereed proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching, CPM 2012, held in Helsinki, Finalnd, in July 2012. The 33 revised full papers presented together with 2 invited talks were carefully reviewed and selected from 60 submissions. The papers address issues of searching and matching strings and more complicated patterns such as trees, regular expressions, graphs, point sets, and arrays. The goal is to derive non-trivial combinatorial properties of such structures and to exploit these properties in order to either achieve superior performance for the corresponding computational problems or pinpoint conditions under which searches cannot be performed efficiently. The meeting also deals with problems in computational biology, data compression and data mining, coding, information retrieval, natural language processing, and pattern recognition.
In 2015-2016 the Department of Communication, Media and Design of the National Research University “Higher School of Economics” in collaboration with non-profit organization ROCIT conducted research aimed to construct the Index of Digital Literacy in Russian Regions. This research was the priority and remain unmatched for the momentIn 2015-2016 the Department of Communication, Media and Design of the National Research University “Higher School of Economics” in collaboration with non-profit organization ROCIT conducted research aimed to construct the Index of Digital Literacy in Russian Regions. This research was the priority and remain unmatched for the moment
Companies are increasingly paying close attention to the IP portfolio, which is a key competitive advantage, so patents and patent applications, as well as analysis and identification of future trends, become one of the important and strategic components of a business strategy. We argue that the problems of identifying and predicting trends or entities, as well as the search for technical features, can be solved with the help of easily accessible Big Data technologies, machine learning and predictive analytics, thereby offering an effective plan for development and progress. The purpose of this study is twofold, the first is an identification of technological trends, the second is an identification of application areas and/or that are most promising in terms of technology development and investment. The research was based on methods of clustering, processing of large text files and search queries in patent databases. The suggested approach is considered on the basis of experimental data in the field of moving connected UAVs and passive acoustic ecology control.
The article is dedicated to the analysis of Big Data perspective in jurisprudence. It is proved that Big Data have to be used as the explanatory and predictable tool. The author describes issues concerning Big Data application in legal research. The problems are technical (data access, technical imperfections, data verification) and informative (interpretation of data and correlations). It is concluded that there is the necessity to enhance Big Data investigations taking into account the abovementioned limits.