Scaling Data-Driven Scientific Discovery in the Big-Data Era: A Data Lifecycle View

Scaling Data-Driven Scientific Discovery in the Big-Data Era: A Data Lifecycle View


  • Suhas Somnath, NCCS Advanced Data and Workflow Group
December 6, 2018 - 10:00am to 11:00am


Many scientific disciplines are undergoing profound changes, driven by advanced machine learning algorithms, continual improvements to observational instruments that have resulted in an explosion in the data volume, dimensionality, complexity, variety, and increased access to high-performance computing resources. Starting with data acquisition, retaining the complete, information-rich data stream from the sensors can dramatically improve the detection limits and measurement throughput. Storing the acquired data into community-developed standardized file formats instead of proprietary file formats is essential for transparent access to data and metadata, long-term archival, seamless data exchange, and compatibility with distributed computing. The data explosion necessitates the storage of data in centralized failure-safe repositories and the use of cloud- and high-performance-computing resources instead of instrument-local workstations. Robust infrastructure will be necessary to connect the experimental or observational facility (EOF) to dedicated data facilities (DF) such as the Compute and Data Environment for Science (CADES) to access the data storage and analysis resources.  Once datasets from EOFs begin to flow into the DF, the data can be organized and shared between researchers or across multiple laboratories using a scientific data management system, which will behave as the iTunes for scientific data. Data can then be analyzed using domain-specific and machine-learning algorithms by leveraging various computational resources at CADES or the Oak Ridge Leadership Computing FacilityScientifically important datasets can then be peer-reviewed and published, awarded a digital object identifier, and uploaded to a centralized data catalog, similar to journal articles.  These high-quality and annotated datasets can be mined to understand broader trends. The knowledge obtained from data-driven discoveries can feed back into the conventional modes of observational and modeling based scientific discovery.  This talk will cover efforts at each of these phases of the data lifecycle to accelerate data-driven discovery.

Additional Information 

About the Speaker:

Suhas Somnath is a computer scientist at the National Center for Computational Sciences at ORNL straddling the physical and computational domains in finding artificial intelligence, computing, and infrastructure solutions for problems in the domain sciences.  Over the course of his graduate studies and postdoctoral research he developed nanoscale metrology and manufacturing techniques using numerical modeling, microfabrication, electronics, software, and hardware.  He also developed several material characterization techniques that leveraged big data and machine learning techniques, founded a popular open-source software package called pycroscopy for analyzing large microscopy datasets, and was part of a team that connected microscopes to supercomputers for near-real-time analysis of large microscopy datasets.

Sponsoring Organization 

National Center for Computational Sciences


  • Research Office Building
  • Building: 5700
  • Room: L-204

Contact Information