
Rapid and effective response to an emerging biothreat demands high-quality, accurate, and up-to-the-minute information about disease spread. More than 90 percent of physicians in the U.S. collect electronic health data, but broad use of this vital information for public health surveillance and monitoring continually hits roadblocks due to a lack of AI-ready data and collaborative high-performance computing workflows.
The Electronic Health Records-Informed LagrangIan method for precision publiC Health (EHRLICH) project uses artificial intelligence models to automatically code unstructured clinical documents and combine these records with multimodal health and environmental data. During a biothreat scenario, these tools will help synthesize data and enable population-level simulations while preserving patient privacy and data security. These simulations will allow decision-makers to anticipate and reduce risk, improve readiness, evaluate “what-if” scenarios, and inform future investments in surveillance systems and diagnostic tools.
This project, led by top researchers at Argonne, Los Alamos, and Oak Ridge national laboratories, will apply the Department of Energy's unique capabilities to build a population-scale digital twin for public health – a set of dynamic, computational models that mirror how biological threats spread across communities. The EHRLICH team will employ the national laboratory complex’s world-renowned leadership computing facilities, powered by exascale supercomputers running at more than 1 quintillion calculations per second, to develop high-performance computing tools that will enable deep-learning capabilities for biopreparedness at scale and aid rapid decision-making to confront biothreats.
Previous accomplishments by the EHRLICH team include the development of:
- Generating Unidentifiable And Realistic Documents (GUARD): An autonomous workflow for extracting information from private datasets and generating datasets while preserving privacy.
- Compute-Aware Federated Augmented Low-rank AI Training (CAFÉ AU LAIT): A novel two-stage algorithm that allows weak clients to participate in privacy-preserving federated training.
- Framework for Uncertainty via Sequential Inference and Optimized Networks (FUSION): A framework to enable robust data assimilation by integrating sequential Monte Carlo sampling and diffusion-based inference to dynamically update model states and uncertainties in real time.
- Efficient National-scale Agent Based Learning Environment (ENABLE): A hybrid CPU-GPU framework for data-driven agent-based population health simulations.
The EHRLICH team is also participating in the National Artificial Intelligence Research Resource (NAIRR) secure pilot program. AI will help our scientific team to develop the necessary infrastructure for near real-time situational readiness that will prepare authorities at the national, state, and local levels to face the next major biothreat.
MOSSAIC: Ramping Up the Fight Against Cancer
Effective treatment and prevention of cancer depends on early, consistent, and accurate reporting of cases. Doctors diagnosed nearly 2 million cases of cancer in the U.S. in 2022 alone – an average of 5,250 people per day – but analysis of these cases at the national level lags by two years due to a lack of centralized data collection and to delays in reporting.
The Modeling Outcomes Using Surveillance Data and Scalable Artificial Intelligence for Cancer (MOSSAIC) project seeks to cut through these delays and speed up reporting on these cases by using artificial intelligence models to extract meaningful information from these complex medical reports and to modernize the National Cancer Institute’s Surveillance, Epidemiology, and End Results cancer registry and statistics program.
Clinical notes, medical reports, and other electronic health records tend to be long, complicated documents without formal structure. Converting these documents to structured formats that can be analyzed and integrated with other clinical data sources has traditionally been done manually – a lengthy, complex, and time-consuming task.
MOSSAIC’s AI tools for cancer incidence reporting enable automatic coding of these long, clinical text documents for near-real time reporting and rapid categorization of cancer cases. Recent additions to the MOSSAIC suite include AI tools for early identification of recurrent metastatic cancer, a critical resource for patients and providers. The U.S. has no current population-scale data on risk for recurrent metastatic cancer, which would require the constant review of huge amounts of clinical text by a wide-ranging set of medical professionals. The new AI models could fill this gap and improve identification of this disease nationwide.
Previous accomplishments by the MOSSAIC team include the development of:
• OncoID and OncoIE, two production level models that have been integrated into the workflows for all central cancer registries and are used to assist with the identification of nearly half of all cancer cases in the U.S.
• OncoMetsID, the first machine learning model for the classification of metastatic cancer.
• Path-BigBird, the first oncology foundation AI model trained from scratch on cancer pathology data, with the capacity to learn structural information from unlabeled text. Foundation models are trained on broad sets of data and then adapted to a wide range of tasks, such as identifying types of cancer.
• BARDI, a comprehensive text preparation and AI readiness framework designed to ensures clinical text can be processed properly for various machine learning applications.
• CONNECT, a population-scale resource for linking environmental exposures to cancer outcomes. Residential history data spanning 1990 to the present are being linked to air pollution exposure data, toxic release data, and indoor home radon exposure data to determine the likelihood of environmental exposure for cancer patients.
MOSSAIC will help our scientific team to develop the much-needed tools for rapid detection and analysis of cancer cases and outcomes, which will help uncover new insights into cancer treatment and prevention.
C-HER: Centralized Health and Exposomic Resource
The Centralized Health and Exposomic Resource (C-HER) lies at the core of the exposomic research led by the MOSSAIC and EHRLICH teams.
C-HER brings together diverse environmental and place-based exposure data with population health information to support studies of cancer and other health outcomes. C-HER includes modern computational tools, curated datasets, and detailed documentation designed to make exposomic research more accessible and actionable. As an AI-ready data platform, C-HER serves as the foundation for advanced analytic tools used by the MOSSAIC and EHRLICH teams, enabling faster, more integrated research into how environmental factors influence human health.
