Abstract
Deep learning models are efficient computational tools that can accelerate the inverse design of molecules with desired functional properties by generating predictions at a fraction of the time required by traditional quantum chemical approaches. To ensure that a model maintains accuracy and transferability across broad regions of the chemical space explored during the inverse design, it must be trained on massively large volumes of simulation data. This requires running large-scale ensemble quantum chemical calculations on high-performance computing (HPC) systems for data collection. However, the efficient execution of such large ensemble calculations and the management of large volumes of output data require tools that can judiciously utilize computational resources and manage metadata overhead on the file system. Therefore, we present a high-performance, scalable, ensemble management framework for performing data-intensive quantum chemical electronic structure calculations for organic molecules. This framework provides abstractions to plug different ab initio, first principles, and first principles-based semi-empirical methods and executes them efficiently at large scale on HPC systems. It dynamically distributes tasks to resources and uses tiered storage for managing large collections of files. We employed this framework to process over ten million organic molecules and generate open-source datasets that provide UV-vis absorption spectra by running time-dependent density-functional tight-binding calculations. It is the largest database containing molecular optical spectra that were simulated with quantum chemical methods in a consistent manner.