Skip to main content
SHARE
Research Highlight

Adaptive Language Model Training for Molecular Design

Brief: Oak Ridge National Laboratory (ORNL) researchers have developed a new strategy for molecule generation to accelerate molecular design for drug discovery applications.

Accomplishment: Building upon recent advances in applying techniques from natural language processing to chemical sequences [1,2], researchers developed a new strategy to accelerate molecular design for drug discovery applications.  Masked language models for molecule data (i.e., molecular transformers) can be used to automate the generation of new chemical structures by learning commonly occurring subsequences and rearrangements based on data from large compound libraries.  By adding a scoring model, the language model can be used to generate optimized sequences for a given task, such as predicted binding affinity for a protein target.  To boost the optimization performance, the generative model can be further adapted by training on intermediate populations of optimized molecules.  The adaptive strategy, in contrast to a fixed model for generation, empowers the use of molecular transformers for a range of design tasks. 

Strategy for molecule optimization using a language model (i.e., molecular transformer). CCSD ORNL Oak Ridge National Laboratory AI Initiative
Strategy for molecule optimization using a language model (i.e., molecular transformer). An initial population of molecules is used as input. The language model then generates mutations using predictions for randomly placed masks. Molecules are ranked according to a specified score and top performers are selected for another round of mutations. Two approaches for the language model are investigated, fixed and adaptive. For the fixed approach, the language model is pre-trained on a large molecule dataset, and it does not change during the optimization process. For the adaptive approach, the language model is trained on the selected population, which itself changes during the optimization process.

Acknowledgement: This research was funded by the AI Initiative, as part of the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy (DOE); the Exascale Computing Project (ECP) (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Publications and presentations resulting from this work: Blanchard, A.E., Bhowmik, D., Gounley, J., Glaser, J., Akpa, B., Irle, S. Adaptive Language Model Training for Molecular Design. In preparation.

Contact: Andrew Blanchard (blanchardae@ornl.gov)

Team: Andrew Blanchard, Debsindhu Bhowmik, John Gounley, Jens Glaser, Belinda Akpa, Stephan Irle

References:  

  1. Blanchard, A.E., Chandra Shekar, M., Gao, S., Gounley, J., Lyngaas, I., Glaser, J., Bhowmik, D.: Automating Genetic Algorithm Mutations for Molecules Using a Masked Language Model. IEEE Transactions on Evolutionary Computation. 2022.
  2. Blanchard, A.E., Gounley, J., Bhowmik, D., Chandra Shekar, M., Lyngaas, I., Gao, S., Yin, J., Tsaris, A., Wang, F., Glaser, J. Language Models for the Prediction of SARS-CoV-2 Inhibitors. Finalist, ACM Gordon Bell Special Prize for COVID-19 Research. 2021.