Project Details
Training AI to read the language of biology
Bridging Genes to Traits with a Unified AI Framework
Photosynthesis depends on thousands of genes, proteins, and environmentally responsive processes working together. This complexity makes it difficult for traditional models to predict how genetic changes affect plant performance. GPTgp solves this challenge by using artificial intelligence to learn from multiple types of biological data at once.
The model integrates DNA sequences, gene expression, protein structures, and measurements of plant traits like gas exchange and hyperspectral imaging. Using a transformer-based architecture, GPTgp brings these data into a shared representation space, allowing researchers to uncover predictive links between genes and photosynthetic traits. The capability will enable scientists to prioritize genetic changes to test before conducting lengthy field trials.
Unlocking Nature's Untapped Potential
Even after decades of crop improvement, today’s most productive crops photosynthesize with only about half the productivity of what some plants achieve in nature. The most elite cultivars can reach photosynthesis rates of around 30 μmol CO₂ per square meter per second, but desert-adapted species like Amaranthus palmeri can exceed 70 μmol CO₂ per square meter per second in conditions with high light, warm temperatures, and ample water.
This gap represents an enormous opportunity to improve crop productivity; however, photosynthesis operates across many levels—from molecules to leaves to whole plant canopies—and includes complex feedback loops and trade-offs, making simple improvements difficult to predict. For example, a genetic change that boosts yield in one species may reduce growth in another. Predicting these outcomes requires more advanced, integrative tools.
Photosynthesis as Language
GPTgp draws on a fundamental rule in biology: that photosynthesis is shaped by both genes and the environment. This makes it possible to apply large language model concepts to biological systems.
In GPTgp, the elements of DNA sequences act like words in a sentence, preserving meaningful patterns such as codon usage and regulatory motifs. Similarly, protein structures, gene expression, scientific images, environmental conditions, and physical measurements all become part of a shared biological vocabulary. By learning this “language of photosynthesis,” GPTgp has the potential to become the first AI model capable of “reading” photosynthetic biology.
Using AI to Predict Plant Performance
GPTgp is designed to accelerate the design-build-test cycle in plants. Researchers can screen candidate genes, alleles, and engineering strategies computationally before committing to costly multi-year field trials and can prioritize variants most likely to improve real-world photosynthetic performance across different species and environments.
The model also supports learning across pathways and taxa, where insights from well-studied model organisms can inform engineering in bioenergy crops. By generating predictions for new genotypes and conditions, GPTgp can serve as an intelligent assistant for the next generation of plant scientists and breeders working toward more productive and resilient food and energy systems.
Accelerating Discovery Science
The GPTgp project is part of the Genesis Mission—DOE’s bold new endeavor to build the world’s most powerful scientific platform to accelerate discovery science, strengthen national security, and drive energy innovation. GPTgp is supported by the DOE Biological and Environmental Research program.