A new approach to configurable primary data collection
- •We present a web-based user-configurable primary data collection tool.
- •To collect data from non-cooperative sources, control flow is recovered from data flow.
- •Data are transformed, sanitized and safely uploaded to the target system.
- •Data transformation/sanitization at the source reduces confidentiality issues.
- •The tool offers the option of manual entry of data absent in the source data sets.
Background and objectives
The formats, semantics and operational rules of data processing tasks in genomics (and health in general) are highly divergent and can rapidly change. In such an environment, the problem of consistent transformation and loading of heterogeneous input data to various target repositories becomes a critical success factor. The objective of the project was to design a new conceptual approach to configurable data transformation, de-identification, and submission of health and genomic data sets. Main motivation was to facilitate automated or human-driven data uploading, as well as consolidation of heterogeneous sources in large genomic or health projects.Methods
Modern methods of on-demand specialization of generic software components were applied. For specification of input–output data and required data collection activities, we propose a simple data model of flat tables as well as a domain-oriented graphical interface and portable representation of transformations in XML. Using such methods, the prototype of the Configurable Data Collection System (CDCS) was implemented in Java programming language with Swing graphical interfaces. The core logic of transformations was implemented as a library of reusable plugins.Results
The solution is implemented as a software prototype for a configurable service-oriented system for semi-automatic data collection, transformation, sanitization and safe uploading to heterogeneous data repositories—CDCS. To address the dynamic nature of data schemas and data collection processes, the CDCS prototype facilitates interactive, user-driven configuration of the data collection process and extends basic functionality with a wide range of third-party plugins. Notably, our solution also allows for the reduction of manual data entry for data originally missing in the output data sets.Conclusions
First experiments and feedback from domain experts confirm the prototype is flexible, configurable and extensible; runs well on data owner's systems; and is not dependent on vendor's standards.