We focus on two areas of statistics that are of importance to scalable data science: (1) Research for the development of scalable analytics and (2) research for understanding of high-dimensional response functions. (1) The development of scalable analytics is lagging far behind the simulation sciences in its use of high performance computing resources. Those who develop statistical analytics prefer to work with high level programming languages that are close to mathematics. They know what can be asynchronous in the mathematics but do not know the additional intricacies needed for developing and running codes on large computational platforms. As a result, there exist large diverse collections of serial analytical tools but only a scant number of the analytics are scalable. We are developing high level methods for programming with big data (pbd) to engage and enable this community to prototype new scalable codes. (2) Computational science codes often have large numbers of parameters that influence their output. It is often difficult to understand which parameters and parameter interactions are important over an input region of interest. While statistical techniques exist for variability attribution to parameters, they rely on designed sample spaces which are typically not available in simulation science collections either because the parameter space is too large or simply because statistical design was not used in selecting parameter combinations. We use a combination of surrogate models and analysis of variance to provide variability attribution and parameter effect estimation techniques.
Presentation: Applied Statistics for the Office of Science