Skip to main content
Publication

Provenance–aware workflow for data quality management and improvement for large continuous scientific data streams

Publication Type
Conference Paper
Publication Date
Page Numbers
3260 to 3266

Data quality assessment, management and improvement is integral part of any big data intensive scientific research to ensure accurate, reliable and reproducible science discoveries. The task of maintaining the quality of data, however, is non-trivial and pose a challenge for a program like Department of Energy’s Atmospheric Radiation Measurement (ARM) that collects data from hundreds of instruments across the world,
and distributes thousands of streaming data products that are continuously growing in near-real-time in an archive 1.7 Petabyte in size and growing. We present a computational data processing workflow to collect the data quality issue via an easy and intuitive web-based portal that allows reporting of any quality issues for any site, facility or instruments at a granularity down to individual variable in the data files. Portal allows instrument specialists and scientists to provide corrective actions in form of symbolic equation. A parallel processing framework applies the data improvement to large volume of data in efficient parallel environment, while optimizing data transfer and file I/O operations. Corrected files are systematically versioned and archived. A provenance tracking module tracks and records any change made to the data during its entire life cycle which are communicated transparently to the scientific users. Developed in Python using open source technologies, software architecture enables efficient and fast management and improvement of data in an operational data center environment.