Big Data Normalization for Massively Parallel Processing Databases

N. Golov; Rönnbäck L.

doi:10.1016/j.csi.2017.01.009

Publications

?

Big Data Normalization for Massively Parallel Processing Databases

Computer Standards and Interfaces. 2017. Vol. 54. No. P2. P. 86–93.

Golov N., Rönnbäck L.

High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. Furthermore, there is a contradiction between ease of extending the data model and ease of analysis. The modern 'Data Lake' approach, promises extreme ease of adding new data to a data model, however it is prone to eventually becoming a Data Swamp - unstructured, ungoverned, and out of control Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. This technique is almost as convenient for expanding data model as a Data Lake, while it is internally protected from transforming to Data Swamp. A case study of how this approach is used for a Data Warehouse at Avito over a three-year period, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, is also presented. This paper is an extension of theses from The 34th International Conference on Conceptual Modeling (ER 2015) (Golov and Rönnbäck 2015) [1], it is complemented with numerical results about key operating areas of highly normalized big data warehouse, collected over several (1-3) years of commercial operation. Also, the limitations, imposed by using a single MPP database cluster, are described, and cluster fragmentation approach is proposed.

Priority areas: IT and mathematics business informatics

Language: English

DOI

Text on another site

Keywords: analytics big data MPP Database Normalization Ad-hoc Querying Performance Modeling Data Lake

Цифровое общество: теоретическая модель и российская действительность

Смирнов А. В., Мониторинг общественного мнения: Экономические и социальные перемены 2021 № 1 С. 129–153

The article considers a theoretical model of digital society based on four concepts: super-connectivity, platformisation, datafication, and algorithmic governance. The model describes how the digitalisation of society deepens: from the transfer of individual practices and social interactions to a new social order based on big data. Analysis of panel data from the 2003–2018 longitudinal survey ...

Added: March 18, 2026

Прогнозирование миграционных процессов методами цифровой демографии

Смирнов А. В., Экономика региона 2022 Т. 18 № 1 С. 133–145

The nature and intensity of migration processes are constantly changing. Demographic statistics are not suitable for obtaining up-to-date information and making timely decisions in the field of demographic and social policy. Thus, digital demography is becoming increasingly important, as this area of population research uses new methods and data sources resulting from the Internet expansion ...

Added: March 18, 2026

Направления научного сотрудничества и особенности культурного обмена России со странами Ближнего Востока и Средиземноморья по материалам современной аналитики

Ли О. В., Пространство науки 2024 Т. 1 № 4 С. 736–750

Russia has to participate in the struggle for cultural influence, which is escalating all over the world, and promote its values and ideas. In this context, the relevance and importance of cultural and humanitarian cooperation with the countries of the Middle East and the Mediterranean is increasing. The article identifies key problems and makes recommendations ...

Added: March 12, 2026

Методология и задачи прикладной аналитики

Isakov V., Ильин Н. И., В кн.: Прикладная аналитика: коллективная монография.: МГУ, МАКС Пресс, 2025. С. 26–42.

The article discusses the concept and types of analytics, the concept of analytics methodology. The system of analytical methods is revealed. The concept of the methodological profile of analytical research is introduced. The principles of modern analytical research are considered. ...

Added: March 3, 2026

Прикладная аналитика: коллективная монография

Бахтизин А. Р., Ильин Н. И., Isakov V., МГУ, МАКС Пресс, 2025.

The monograph systematically sets out the fundamental principles, basic methods, and tools of applied analytics. Particular attention is paid to practical recommendations for analysts in specific areas: global processes, macroeconomics, science and technology, industry, fuel and energy, construction, agriculture, national security, socio-political system, national projects and programs, demography, and much more. The monograph is intended ...

Added: March 2, 2026

Организационно-деятельностные игры как технология аналитики

Isakov V., В кн.: Лучшие аналитики России - наши современники. Выпуск 2.Вып. 2.: М.: Красанд, 2025. С. 233–245.

The article substantiates the possibility of using organizational and activity games as one of the technologies for solving analytical problems. The stages and principles of the methodology of conducting ML games are shown. The author's personal experience of participating in online games is highlighted. ...

Added: March 1, 2026

Лучшие аналитики России - наши современники. Выпуск 2.

Isakov V., Karaganov S. A., Naumkin V., М.: Красанд, 2025.

The collection is the result of extensive and painstaking work by a group of scientists representing the Analytica Association. The second issue of the collection includes articles by leading Russian philosophers, economists, historians and lawyers. Special attention is paid to civilizational analytics. The publication presents previously unpublished materials by the outstanding sociologist Alexander Zinoviev. Sergey ...

Added: March 1, 2026

Improving guest satisfaction by identifying hotel service micro-elements failures through Deep Learning of online reviews

Kazakov S., Cuesta-Valiño P., Butkovskaya V. et al., Cuadernos de Gestion 2025 Vol. 25 No. 1 P. 71–88

This study provides an in-depth examination of often-overlooked hotel service micro-elements within the broader spectrum of hospitality services, with the aim of improving service delivery and enhancing guest satisfaction. To achieve this, we develop a methodological framework that integrates: (a) VADER text-based sentiment analysis, (b) a robust logistic regression procedure to identify the specific hotel ...

Added: February 28, 2026

Правовая аналитика в государственном управлении

Isakov V., Академический юридический журнал 2024 Т. 25 № 3 С. 500–516

The article reveals the role and place of legal analytics in public administration. According to the author, it is based on the analysis of various legal situations. The structural elements of the legal analytical situation and the variants of its dynamics are analyzed. The types of legal analytics are considered, among which information analytics, data ...

Added: February 27, 2026

Data Analytics for Predicting Situational Developments in Smart Cities: Assessing User Perceptions

Kharlamov A. A., Pilgun M., , in: Special Issue Sensing Technology for Smart Cities: Data, Analytics, and VisualizationsVol. 24. Issue 15.: [б.и.], 2024.

The analysis of large volumes of data collected from heterogeneous sources is increasingly important for the development of megacities, the advancement of smart city technologies, and ensuring a high quality of life for citizens. This study aimed to develop algorithms for analyzing and interpreting social media data to assess citizens’ opinions in real time and ...

Added: February 22, 2026

Special Issue Sensing Technology for Smart Cities: Data, Analytics, and Visualizations

[б.и.], 2024.

Nowadays a huge portion of population lives in urban areas, and projections indicate that most cities are going to be confronted with a growing urban population in the next few years. This undoubtably poses new challenges that must be addressed by city councils and stakeholders to guarantee citizens’ high quality of life. Mobility, pollution, climate ...

Added: February 15, 2026

ALGORITHMIZATION OF LAW ENFORCEMENT MANAGEMENT PROCESSES USING ARTIFICIAL INTELLIGENCE

Barchukov, V., Relacoes Internacionais no Mundo Atual 2024 Vol. 4 No. 46 P. 113–132

Objective: Despite the opportunities that are opening up due to the development of information support systems and artificial intelligence in law enforcement, unfortunately, the Russian Federation has not yet fully formed a scientifically based legal and organizational framework for their integrated and practical application in activities of law enforcement agencies. The article aims to assess ...

Added: January 20, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026

Artificial Intelligence for Urban Planning and Building Smart Cities

Demekhina A., Milshina Y., , in: Artificial Intelligence Enabled Real Time Environmental Monitoring.: Springer, 2026. P. 253–281.

Added: January 13, 2026

Implementing Transport Coding in OMNeT++ for Message Delay Reduction

Petrovanov I., Sergeev A., / Series Computer Science "arxiv.org". 2025. No. 2512.18332.

Transport coding reduces message delay in packet-switched networks by introducing controlled redundancy at the transport layer: original packets are encoded into coded packets, and the message is reconstructed after the first successful deliveries, effectively shifting latency from the maximum packet delay to the -th order statistic. We present a concise, reproducible discrete-event implementation of transport coding in OMNeT++, including ...

Added: December 24, 2025

Классификации и классификаторы в науке и аналитике

Isakov V., Юридическая техника 2024 № 18 С. 17–31

This consultation is devoted to two closely interrelated issues: the first part examines the logical and methodological foundations of the classification approach in analytics, the second part applies this approach to the analytics itself as an object of classification, considers its types and types. ...

Added: December 15, 2025

Правовая аналитика: интеллектуальные технологии юридической деятельности

Isakov V., Н. Новгород: Нижегородская академия МВД России, 2025.

The country, society, legal science and practice are at a turning point. Information technologies are powerfully entering life, which are changing social relations, creating new subjects and relationships that did not exist before. A distinctive feature of these technologies is the need to interact with artificial intelligence. It is wrong to stand on the point of ...

Added: December 15, 2025

Digital Representation of Youth’ Agency in Culture: A Database of Projects

Sorokin P. S., Novikova V., Goshin M. E., Frontiers in Psychology 2025 Vol. 16 Article 1716164

Agency is a multidisciplinary concept denoting individual ability to initiate and carry out actions that transform one’s own life and social environment. Traditionally, in psychology, agency is viewed as an internal characteristic—a constellation of motives, attitudes, and cognitive schemas that drive human action. Studies of this sort focus on the desire to be active, the ...

Added: December 11, 2025

Перспективы интеграции новых цифровых технологий в современное образование для повышения его эффективности

Бояров Е. Н., Социальная компетентность 2025 Т. 10 № 2 С. 42–51

The article addresses the problem of integrating new digital technologies into modern education to enhance its effectiveness and quality. The purpose of the study is to summarize theoretical and practical approaches to the use of digital tools in educational environments and to identify key directions and barriers to the digital transformation of education. The research ...

Added: December 9, 2025

Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

Меньшиков И. А., Бернадотт А. К., Elvimov N. S., / Series arXie "Statistical mechanics". 2025.

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image ...

Added: December 1, 2025

Determining the boundary of dynamical chaos in the generalized Chirikov map via machine learning

Чернышов Д. П., Satanin A., Shchur L., / Series arXiv "math". 2025.

We investigate the boundary separating regular and chaotic dynamics in the generalized Chirikov map, an extension of the standard map with phase-shifted secondary kicks. Lyapunov maps were computed across the parameter space (K,K(α, τ)) and used to train a convolutional neural network (ResNet18) for binary classification of dynamical regimes. The model reproduces the known critical ...

Added: November 21, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Мир стоит на пороге эпохи технологической сингулярности. Как изменятся тренды базовых глобальных процессов и эволюция человечества

Akaev A., Ильин И. В., Korotayev A., Вестник Российской академии наук 2025 № 9 С. 3–15

The article examines the likelihood of creating artificial intelligence (AI) at the human level (“human intelligence level”, AGI) by 2027-2029 and the onset of the era of technological singularity, when a fundamental change in the mechanism of human evolution will occur. It is noted that this probability is close to one, since these dates surprisingly ...

Added: October 28, 2025

Diffusion on language model embeddings for protein sequence generation

Meshchaninov V., Strashnov, P., Shevtsov A. et al., / Cornell University. Серия CoRR, arXiv:2403.03726 "Computing Research Repository,". 2025.

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived ...

Added: October 5, 2025