ORNL data scientists issue 1.0 software release, continue drawing new statistical data users to HPC with pbdR
Scientists at the Department of Energy’s Oak Ridge National Laboratory have released pbdR 1.0, a full suite of software packages they developed to make the R programming language easy to use for high-performance computing. R is the most commonly used statistical data analysis software among academic researchers.
Intended to enable big data analysis — analyzing large groups of data on leadership-class supercomputers such as ORNL’s Titan and Summit — among research disciplines that typically work with small datasets, pbdR offers ease of installation and use, computational speed and capability across multiple operating systems.
Titan currently is the fastest computer in the United States, and Summit, slated to be five to ten times faster than Titan, is due to come online this year.
pbdR uses a distributed data framework to break down large datasets into smaller information chunks analyzed by multiple processors using MPI, the message-passing standard for parallel computing. Parallel computing makes HPC possible by sending instructions over a network of independent processor nodes that communicate with each other to analyze data. This division of labor enables high-speed processing of vast amounts of data.
“This 1.0 release was driven by vendor requests for single-focus software stacks,” said George Ostrouchov, a senior data scientist with ORNL’s Scientific Data Group in the Computer Science and Mathematics Division. “Vendors like Cray wanted to ship their equipment with the software installed.”
Ostrouchov and Drew Schmidt, a software engineer with the Oak Ridge Leadership Computing Facility’s Advanced Data and Workflow Group, introduced the first pbdR modules in 2012. The OLCF is a DOE Office of Science User Facility at ORNL.
The developers, who share an interest in using computing to solve statistical inquiries, said that many traditional statistical methods focus on first transforming large datasets into small ones by sampling, which can make it challenging to introduce statisticians to leadership-class computing.
“In 2012, we felt R-using researchers were not as engaged as we would like them to be,” Ostrouchov said. “Even if a statistical method focuses on rendering big data problems smaller, researchers need to be able to experiment with large systems and demonstrate the value of data reductions.”
Their vision was to build a bridge to HPC for scientists in fields not yet involved in supercomputing by providing access to big data analysis tools that use a familiar syntax. They have worked steadily since pbdR’s release to spur widespread adoption across academic disciplines, and they’ve seen encouraging growth as their tools have become more user friendly and customizable.
“pbdR is a modular product,” Schmidt said. “You can use one module without using the others, or you could use any combination of two or more that suits your particular data analysis needs.”
Researchers across ORNL are reaching out to Schmidt and Ostrouchov for help with applying pbdR to data analysis for a broadening range of academic areas, including bioinformatics, software archeology, microscopy for materials science and climate data analysis.
In a Laboratory Directed Research and Development project, Junqi Yin, a computational scientist in the Advanced Data and Workflow Group examining materials science problems, investigated the best models for machine learning by running simulations on Titan to generate training data. Schmidt worked with Yin to adjust his data analysis approach to improve testing results. They used pbdR to manage the model fitting and selection, including automatically handling checkpoint restart. Schmidt and Yin are again collaborating to improve results.
In addition to the current full-suite release, various subsets of the component packages have been installed at U.S. and overseas university computing centers, several DOE-funded national laboratories, and the National Institutes of Health.
A review of distributed data analytics software conducted this year at the University of California, San Diego, found that pbdR outperforms all others in most dense data scenarios. It was also judged to be one of the best interface languages to improve user productivity.
Ostrouchov and Schmidt’s next steps include adapting pbdR to utilize GPUs, the low-energy profile coprocessors outfitting supercomputers like Titan and Summit. “Developing this end of pbdR’s capabilities will make it all the more useful for HPC tasks,” Ostrouchov said.
The scientists, jointly with staff at UT’s National Institute for Mathematical and Biological Synthesis, are also set to embark on a LiDAR (light detection and ranging) data project for the National Science Foundation. They anticipate that connecting LiDAR data to pbdR capabilities on HPC systems will bring more interest in supercomputing from the ecology community.
ORNL is managed by UT-Battelle for the Department of Energy's Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit http://science.energy.gov/.