?
Implementing Big Data Processing Workflows Using Open Source Technologies
In our implementation research, we apply workflow approach to the modeling and development of the Big Data processing pipeline using open source technologies. The data processing workflow is a set of interrelated steps which launch some particular jobs such as Spark job, shell job or Postgre SQL command. All workflow steps are chained to form integrated process and imitate the data load from staging storage area to the datamart storage area. The experimental workflow-based implementation of a data processing pipeline was performed that stages through different storage areas and uses actual industrial KPI dataset of some 30 millions records. Evaluation of implementation results provides proofs of the applicability of proposed workflow to other application domains and datasets which should satisfy the data format at input stage of the workflow.