Skip to main content
SHARE
Research Highlight

CCSD Researchers Create Two Open-source Datasets for Quantum Chemical Prediction of UV/Vis Absorption Spectra for Organic Molecules

Brief: We released two open-source datasets named GDB-9-Ex [1] and ORNL_AISD-Ex [2] that provide calculations of electronic excitation energies and their associated oscillator strengths based on the time-dependent density-functional tight-binding (TD-DFTB) method. The GDB-9-Ex [1] dataset contains over 96 thousand molecules, and the ORNL_AISD-Ex [2] dataset consists of over 10 million molecules.

[GDB-9-Ex (https://www.osti.gov/biblio/1890227) and  ORNL_AISD-Ex (https://www.osti.gov/biblio/1907919)]

Accomplishment: Within the Surrogates and Design products of the Artificial Intelligence for Science and Discovery (AISD) Thrust of the ORNL AI Initiative, we performed calculations of electronic excitation energies and associated oscillator strengths based on the time-dependent density-functional tight-binding (TD-DFTB) method for two classes of organic molecules: one class is represented by the organic molecules of the GDB-9 dataset [3], the other class is represented by the organic molecules of the AISD-HOMO-LUMO dataset [4]. 

The computed excitation energies and associated oscillator strengths can be used to predict UV/Vis absorption spectra, by interpreting the excitation energies as absorption peak positions, and oscillator strengths as a good measure of the probability of absorption of visible or UV light in transitions between electronic ground and excited states.

Calculating the UV/Vis spectrum of a molecule requires performing 3 main operations: 

  1. Converting the SMILES string representation [5] of a molecule into a geometric structure where each atom is assigned XYZ coordinates after preliminary geometry optimization using the Merck Molecular Force Field (MMFF94). 
  2. Perform geometry optimization on the preliminary geometry to compute the relaxed geometry of the molecule, which corresponds with the position of the atoms at the position of equilibrium at the ground state. 
  3. Calculate the UV/Vis spectrum of the molecule from its optimized geometry. 

GDB-9-Ex [1] provides excitation energies and associated oscillator strengths for 96,766 organic molecules. The molecules differ with respect to their chemical compositions (which span 4 non-hydrogen elements: oxygen, carbon, nitrogen, fluorine) and molecular size (the smallest molecule contains 5 non-hydrogen atoms, and the largest molecule contains 9 non-hydrogen atoms).

ORNL_AISD-Ex [2] provides excitation energies and associated oscillator strengths for 10,502,904 organic molecules. The molecules differ with respect to their chemical compositions (which span 5 non-hydrogen elements: oxygen, carbon, nitrogen, fluorine, sulfur) and molecular size (the smallest molecule contains 5 non-hydrogen atoms, and the largest molecule contains 71 non-hydrogen atoms).

These datasets are essential for the creation of an AI-driven workflow that performs molecular design within the AISD Thrust of the ORNL AI Initiative. 
Specifically, by training complex, stable, and accurate graph neural networks using the scalable HydraGNN architecture [6] on OLCF supercomputers Summit and Frontier, we will construct fast and accurate surrogate models to accommodate a rapid and thorough screening of large chemical regions in the molecular space. This effort will result in an AI-accelerated identification of new synthesizable molecules that attain desired functional properties. 
 

Scatter plots that describe the strong correlation between the HOMO-LUMO gap and the minimum absorption energy for organic molecules of the GDB-9-Ex dataset (left) and ORNL_AISD-Ex dataset (right).  CSED CCSD ORNL AI Initiative
Scatter plots that describe the strong correlation between the HOMO-LUMO gap and the minimum absorption energy for organic molecules of the GDB-9-Ex dataset (left) and ORNL_AISD-Ex dataset (right).

Acknowledgement: This research is funded by the AI Initiative, as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy (DOE). 

Contact: Massimiliano Lupo Pasini (lupopasinim@ornl.gov)

Team: Massimiliano Lupo Pasini, Kshitij Mehta, Pilsun Yoo, Stephan Irle

References:

[1] M. Lupo Pasini, P. Yoo, K. Mehta, and S. Irle. GDB-9-Ex: Quantum chemical prediction of UV/Vis absorption spectra for GDB-9 molecules. United States: N. p., 2022. https://doi.org/10.13139/OLCF/1890227.

[2] M. Lupo Pasini, K. Mehta, P. Yoo, and S. Irle. ORNL_AISD-Ex: Quantum chemical prediction of UV/Vis absorption spectra for over 10 million organic molecules. United States: N. p., 2023. https://doi.org/10.13139/OLCF/1907919.

[3] R. Ramakrishnan, P. Dral, M. Rupp, et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 1, 140022 (2014). https://doi.org/10.1038/sdata.2014.22.  

[4] A. Blanchard, J. Gounley, D. Bhowmik, P. Yoo, and S. Irle. AISD HOMO-LUMO. United States: N. p., 2022. https://doi.org/10.13139/ORNLNCCS/1869409.

[5] D. Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences. 28 (1): 31–6. https://doi:10.1021/ci00057a005

[6] M. Lupo Pasini, S. T. Reeve, P. Zhang, and J. Y. Choi, HydraGNN. Computer software.  https://www.osti.gov//servlets/purl/1826660. Vers. 1.0. USDOE. 19 Oct. 2021. https://doi.org/10.11578/dc.20211019.2.