SQL query optimization for highly normalized Big Data

N. Golov; Ronnback L.

?

SQL query optimization for highly normalized Big Data

Business Informatics. 2015. No. 3.

Golov N., Ronnback L.

This paper describes an approach for fast ad-hoc analysis of Big Data inside a relational data model. The approach strives to achieve maximal utilization of highly normalized temporary tables through the merge join algorithm. It is designed for the Anchor modeling technique, which requires a very high level of table normalization. Anchor modeling is a novel data warehouse modeling technique, designed for classical databases and adapted by the authors of the article for Big Data environment and a MPP database. Anchor modeling provides flexibility and high speed of data loading, where the presented approach adds support for fast ad-hoc analysis of Big Data sets (tens of terabytes). Different approaches to query plan optimization are described and estimated, for row-based and column-based databases. Theoretical estimations and results of real data experiments carried out in a column-based MPP environment (HP Vertica) are presented and compared. The results show that the approach is particularly favorable when the available RAM resources are scarce, so that a switch is made from pure in-memory processing to spilling over from hard disk, while executing ad-hoc queries. Scaling is also investigated by running the same analysis on different numbers of nodes in the MPP cluster. Configurations of 5, 10 and 12 nodes were tested, using click stream data of Avito, the biggest classified site of Russia.

Research target: Computer Science

Priority areas: business informatics

Language: English

Full text

Keywords: базы данных моделирование modeling производительность analytics аналитика performance большие данные big data databases нормализация normalization querying запросы

WebArrayDB: A Geospatial Array DBMS in Your Web Browser

Rodriges Zalipynis R. A., Terlych N., PROCEEDINGS OF THE VLDB ENDOWMENT 2022 Vol. 15 No. 12 P. 3622–3625

Geospatial array DBMSs operate on georeferenced N-d arrays. They provide storage engines, query parsers, and processing capabilities as their core functionality. Traditionally, those have been too heavy for a Web browser to support. Hence, Web Applications, mostly Geographic Information Systems (GISs), run array management on their server back-ends that return small portions of the results ...

Added: August 30, 2022

Синтез информационной системы управления подсистемами технического обеспечения интеллектуальных зданий

Vikentyeva O., Deryabin A. I., Shestakova L. V. et al., Вестник Московского государственного строительного университета 2017 Т. 12 № 10 С. 1191–1201

Subject: smart house maintenance requires taking into account a number of factors - resource conservation, mitigating working expenditures, safety enhancement, ensuring comfort of leisure and operation. Automation of such engineering systems networks as illumination, climate control, security and communication, may be achieved through utilization of contemporary technologies (e.g. IoT – Internet of Things). However, storing ...

Added: November 21, 2017

Труды ХVIII международной конференции DAMDID / RSDL’2016, 11-14 октября 2016, Ершово, Московская область, Россия

НИЯУ МИФИ, 2016.

In 2016 the International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL’2016) was held on October 11 – 14 in the Holiday Center, Ershovo (Moscow region). By tradition the “Data Analytics and Management in Data Intensive Domains” conference (DAMDID) is planned as a multidisciplinary forum of researchers and practitioners from various domains of science and research, promoting ...

Added: January 26, 2017

Сжатие данных в хранилище больших графов

Polyakov I. V., Chepovskiy A., Chepovskiy A., Фундаментальная и прикладная математика 2016 Т. 21 № 4 С. 125–132

В статье рассматриваются методы сжатия данных для хранения графов больших размеров. Предлагаются алгоритмы препроцессинга графа специальной структуры для повышения плотности записи данных и повышения эффективности выполнения базовых операций с графами. ...

Added: December 23, 2017

Intelligent Information and Database Systems: 13th Asian Conference, ACIIDS 2021, Phuket, Thailand, April 7–10, 2021, Proceedings

Springer, 2021.

This book constitutes the refereed proceedings of the 13th Asian Conference on Intelligent Information and Database Systems, ACIIDS 2021, held in Phuket, Thailand, in April 2021.* The 67 full papers accepted for publication in these proceedings were carefully reviewed and selected from 291 submissions. The papers of the first volume are organized in the following topical ...

Added: January 14, 2021

Мультиязыковое моделирование с использованием DSM платформы MetaLanguage

Sukhov A., Lyadova L. N., Zamyatina E., Информатизация и связь 2013 № 5 С. 11–14

Tools of the DSM-platform MetaLanguage for creation of domain specific languages and for multilevel modeling are described. The transformations definition facility provides lower labor consumption for languages development and for model transformations. ...

Added: November 17, 2013

Большие данные в биоинформатике

Назипова Н. Н., Isaev E., Kornilov V. et al., Математическая биология и биоинформатика 2017 Т. 12 № 1 С. 102–119

Секвенирование человеческого генома началось в 1994 году. Понадобилось 10 лет работы многих научных коллективов для того, чтобы получить черновую последовательность ДНК человека. Современные технологии секвенирования позволяют получать геном конкретного человека за несколько дней. Обсуждаются успехи современной биоинформатики, связанные с появлением высокопроизводительных платформ секвенирования, которые не только способствовали расширению возможностей различных направлений биологии и других смежных ...

Added: March 3, 2017

Передача, хранение и обработка больших объемов научных данных

Grigorev A., Isaev E., Тарасов П. А., М.: ИНФРА-М, 2020.

This tutorial discusses large scientific projects and the volumes of data generated by them, provides an overview of scientific computer networks that allow high-speed transmission of large amounts of data for these projects; computing systems offered by leading manufacturers of computer equipment for processing large amounts of data, and providing both the ability to ...

Added: November 10, 2019

Анализ производительности протокола CMIS на примере его реализации в Alfresco и IBM FileNet

Пелепелин И. Е., Ерофеев Е. В., Программная инженерия 2012 № 7 С. 42–47

This article discusses comparative testing performance results of basic operations on documents in a Repository (IBM FileNet, Alfresco) for CMIS realization in comparison with Native API one. Some test results produce a great performance reduction in the case of CMIS using. The practical approaches of bounds definition in use of CMIS are determined. ...

Added: March 1, 2013

Технологии и инфраструктура Big Data

Radchenko I., Николаев И. Н., СПб.: Университет ИТМО, 2018.

В учебном пособии в сжатой форме излагаются основные принципы, подходы и направления технологий и инфраструктуры Big Data. Авторы дают краткий обзор подходов и определений, предоставляют обзор экосистемы Больших данных и раскрывают тему систем управления Большими данными. В учебном пособии также представлен краткий обзор областей применения Больших данных и архитектура системы обработки Больших данных. Отдельно рассказывается ...

Added: September 29, 2018

BitFun: Fast Answers to Queries with Tunable Functions in Geospatial Array DBMS

Rodriges Zalipynis R. A., PROCEEDINGS OF THE VLDB ENDOWMENT 2020 Vol. 13 No. 12 P. 2909–2912

Geospatial array DBMSs handle big georeferenced arrays. Due to the geospatial data peculiarities, many queries have tunable parameters with values not known in advance: users gradually tune them until they get a satisfactory result. This generates a series of queries with slightly different structures and very similar outputs. Modern array DBMSs spend the same efforts ...

Added: February 22, 2021

Информатизация стратегического менеджмента в системе архитектуры предприятия

Isaev D., Проблемы теории и практики управления 2014 № 1 С. 64–70

The questions of info-logical modeling of integrated systems for information support of strategic management are considered. The positioning of applied modeling methods regarding the enterprise architecture methodology, which is aimed to description of structure of an organization is presented. ...

Added: February 3, 2014

Кластерный анализ кардиологических данных

Зимина Е. Ю., Статистика и Экономика 2018 Т. 15 № 2 С. 30–37

The article includes the observation of the cluster analysis of medical data on the example of the cardiac data. One of the main effective and commonly used Data Mining methods that applied to the large amounts of information (for example, mathematical economics) are clustering methods: the search for signs of similarity between objects in the study of the subject area ...

Added: May 29, 2018

Design Patterns for a Knowledge-Driven Analytical Platform

Zayakin V.S., Lyadova L.N., Rabchevskiy E. A., Proceedings of the Institute for System Programming of the RAS 2022 Vol. 34 No. 2 P. 43–56

Abstract. The development and support of knowledge-based systems for experts in the field of social network analysis (SNA) is complicated because of the problems of viability maintenance that inevitably emerge in data intensive domains. Largely this is the case due to the properties of semi-structured objects and processes that are analyzed by data specialists using ...

Added: July 23, 2022

Математическое обеспечение программных реализаций алгоритмов кинематики манипулятора для моделей покраски поверхности тел

Vnukov A., Шабном М., Вестник Российского университета дружбы народов. Серия: Инженерные исследования 2014 № 3 С. 38–46

The article discusses the mathematical model of the forward task of kinematics, inverse, positioning capture of the robot is considered. Software implementation of the task allowed conducting the study of convergence and accuracy solution of inverse problem selecting the initial values and randing of angles on each iteration and geting graphs of dependencies accuracy of ...

Added: July 26, 2014

Онтологический подход к интеграции информации в областях с интенсивным использованием данных

Заякин В. С., Lyadova L. N., Рабчевский Е. А., Информационные технологии 2022 Т. 28 № 10 С. 529–538

The development and support of knowledge-based systems for experts in the field of social network analysis (SNA) is complicated because of the problems of viability maintenance that inevitably emerge in data intensive domains. Largely this is the case due to the properties of semi-structured objects and processes that are analyzed by data specialists using data ...

Added: October 22, 2022

Моделирование образовательных процессов и их оптимизация на примере модели работы с электронными образовательными ресурсами

Прокофьев Д. О., Starykh V., Информационные технологии 2015

This study investigates main problems of automation and optimization of educational processes with the help of BPMS and Big Data. The questions concerning process modeling are raised, particularly related to the integration of process-oriented and business analysis systems. The main goal of study is to find possible new way to implement the ideas of metadata ...

Added: October 9, 2015

Оценка производительности Openflow-контроллеров, реализованных на различных серверных платформах

Ю.Л. Леохин, Кузьминков В. В., Качество. Инновации. Образование 2015 Т. 127 № 12 С. 68–78

Test results for the Floodlight, RYU and POX Openflow-controllers implemented on different server platforms are presented in the article. There were three server platforms on which the OpenFlow-controllers' performance was evaluated. ...

Added: February 26, 2016

PROSPECTS OF TRANSFERRIG THE LARGE VOLUMES OF RADIO ASTRONOMY DATA

Isaev E., Tarasov P. A., Odessa Astronomical Publications 2014 Vol. 27 No. 2 P. 72–73

Added: November 24, 2014

Архитектура сетевого управляющего комплекса здания на базе IoT устройств

Vikentyeva O., Kychkin A., Deryabin A. I. et al., Датчики и системы 2018 № 5 С. 32–38

This work considers the problem of designing the architecture of a network management system for a generic module of a modern automated building. To improve the efficiency of building operation given the large influx of data, the architecture of the network management system implements multicontour management of a generic modules using cloud scenarios. Building operation ...

Added: July 19, 2018

Большие данные и их приложения в электроэнергетике: от бизнес аналитики до виртуальных электростанций

Krylov V., Крылов С. В., М.: Нобель Пресс, 2014.

Предназначена для студентов и специалистов в области разработки информационных систем в том числе для электроэнергетики и руководителей ИТ подразделений предприятий, всем, кто работает над планированием направлений развития электроэнергетики и просто интересуется прогресcом в этой области В книге рассматривается направление в области обработки данных, получившее название Большие Данные (Big Data), рассказывается о техниках и технологиях. Главный фокус ...

Added: October 10, 2015

Комбинированный алгоритм выделения сообществ в графах взаимодействующих объектов

Chepovskiy A., Лобанова С. Ю., Бизнес-информатика 2017 Т. 42 № 4 С. 64–73

In this paper, we propose and implement a method for detecting intersecting and nested communities in graphs of interacting objects of different natures. For this, two classical algorithms are taken: a hierarchical agglomerate and one based on the search for k-cliques. The combined algorithm presented is based on their consistent application. In addition, parametric options ...

Added: December 10, 2017

Хранение и обработка графа социальных сетей

Polyakov I. V., Chepovskiy A., Chepovskiy A., Вестник Новосибирского государственного университета. Серия: Информационные технологии 2013 Т. 11 № 4 С. 77–83

In this paper special data structure for big social graph storing and operating is presented. We discuss mainly graph paths searching, obtaining subgrapths and addition of new edges and vertices. ...

Added: October 17, 2013

Efficient Exact Algorithm for Count Distinct Problem

Golov N., Filatov A., Bruskin S., , in: 21st International Conference on Computer Algebra in Scientific Computing (CASC-2019). Springer, 2019. Ch. 11661 P. 67–77.

This paper describes and analyses optimization approaches, which make possible the exact calculation of millions of hierarchical count distinct measures over hundreds of billions data rows. Described approach evolved for several years, in parallel with the growth of tasks from a fast growing internet company, and was finally implemented as a PEAPM (Pipelined Exact Accumulation ...

Added: July 1, 2019