Автоматическое извлечение текстовых и числовых веб-данных для целей социальных наук
The paper is devoted to the procedures of automatic data extraction from web pages, i.e., web scraping of web data. We consider different types of web data such as digital traces and other numeric and text web data as well as its advantages (the speed of data collection and, as a consequence, the continuous coverage, efficiency, etc.) and limitations (the limited representativeness, difficulties in organizing storage of a large amount of data, deviation from the traditional procedure for setting up a study, etc.) in comparison with traditional methods of data collection. Various tools of web data extraction (API, requests, and selenium) are described to illustrate principles of handling static and dynamic web pages. The paper also gives an overview of the basic minimum of competencies for web scraping: in particular, programming using Python and navigating through the web pages’ code. A detailed illustration is given based on a fragment of the data collection process from a recent relevant Russian study.
Pattern structures, an extension of FCA to data with complex descriptions, propose an alternative to conceptual scaling (binarization) by giving direct way to knowledge discovery in complex data such as logical formulas, graphs, strings, tuples of numerical intervals, etc. Whereas the approach to classification with pattern structures based on preceding generation of classifiers can lead to double exponent complexity, the combination of lazy evaluation with projection approximations of initial data, randomization and parallelization, results in reduction of algorithmic complexity to low degree polynomial, and thus is feasible for big data.
The proceedings of the 11th International Conference on Service-Oriented Computing (ICSOC 2013), held in Berlin, Germany, December 2–5, 2013, contain high-quality research papers that represent the latest results, ideas, and positions in the field of service-oriented computing. Since the first meeting more than ten years ago, ICSOC has grown to become the premier international forum for academics, industry researchers, and practitioners to share, report, and discuss their ground-breaking work. ICSOC 2013 continued along this tradition, in particular focusing on emerging trends at the intersection between service-oriented, cloud computing, and big data.
Full texts of third international conference on data analytics are presented.
The practical relevance of process mining is increasing as more and more event data become available. Process mining techniques aim to discover, monitor and improve real processes by extracting knowledge from event logs. The two most prominent process mining tasks are: (i) process discovery: learning a process model from example behavior recorded in an event log, and (ii) conformance checking: diagnosing and quantifying discrepancies between observed behavior and modeled behavior. The increasing volume of event data provides both opportunities and challenges for process mining. Existing process mining techniques have problems dealing with large event logs referring to many different activities. Therefore, we propose a generic approach to decompose process mining problems. The decomposition approach is generic and can be combined with different existing process discovery and conformance checking techniques. It is possible to split computationally challenging process mining problems into many smaller problems that can be analyzed easily and whose results can be combined into solutions for the original problems.
In 2015-2016 the Department of Communication, Media and Design of the National Research University “Higher School of Economics” in collaboration with non-profit organization ROCIT conducted research aimed to construct the Index of Digital Literacy in Russian Regions. This research was the priority and remain unmatched for the momentIn 2015-2016 the Department of Communication, Media and Design of the National Research University “Higher School of Economics” in collaboration with non-profit organization ROCIT conducted research aimed to construct the Index of Digital Literacy in Russian Regions. This research was the priority and remain unmatched for the moment
Companies are increasingly paying close attention to the IP portfolio, which is a key competitive advantage, so patents and patent applications, as well as analysis and identification of future trends, become one of the important and strategic components of a business strategy. We argue that the problems of identifying and predicting trends or entities, as well as the search for technical features, can be solved with the help of easily accessible Big Data technologies, machine learning and predictive analytics, thereby offering an effective plan for development and progress. The purpose of this study is twofold, the first is an identification of technological trends, the second is an identification of application areas and/or that are most promising in terms of technology development and investment. The research was based on methods of clustering, processing of large text files and search queries in patent databases. The suggested approach is considered on the basis of experimental data in the field of moving connected UAVs and passive acoustic ecology control.
The article is dedicated to the analysis of Big Data perspective in jurisprudence. It is proved that Big Data have to be used as the explanatory and predictable tool. The author describes issues concerning Big Data application in legal research. The problems are technical (data access, technical imperfections, data verification) and informative (interpretation of data and correlations). It is concluded that there is the necessity to enhance Big Data investigations taking into account the abovementioned limits.