Chapter 8 Building Resilience into the Metadata-Based ETL Process Using Open Source Big Data Technologies
Extract-transform-load (ETL) processes play a crucial role in data analysis in real-time datawarehouse environments which demand lowlatency and high availability features for functionality. In essence, ETL- processes are becoming bottlenecks in such environments due to complexity growth, number of steps in data transformations, number of machines used for data processing and finally, increasing impact of human factors on development of new ETL-processes. In order to mitigate this impact and provide resilience of the ETL process, a special Metadata Framework is needed that can manage the design of new data pipelines and processes. In this work, we focus on ETL metadata and its use in driving process execution and present a proprietary approach to the design of the metadata-based process control that can reduce complexity, enhance resilience of ETL processes and allowtheir adaptive self-reorganization.We present a metadata framework implementation which is based on open-source Big Data technologies, describing its architecture and interconnections with external systems, data model, functions, quality metrics, and templates. A test execution of an experimental Airflow Directed Acyclic Graph (DAG) with randomly selected data is performed to evaluate the proposed framework.