Large language models (LLMs) have arisen rapidly to the center stage of artificial intelligence as the foundation models applicable to many downstream learning tasks. However, how to effectively build, train, and serve such models for many high-stake and first-principle-based scientific use cases are both of great interests and of great challenges. Moreover, pre-training LLMs with billions or even trillions of parameters can be prohibitively expensive not just for academic institutions, but also for well-funded industrial and government labs. Furthermore, the energy cost and the environmental impact of developing LLMs must be kept in mind. In this work, we conduct a first-of-its-kind performance analysis to understand the time and energy cost of pre-training LLMs on the Department of Energy (DOE)’s leadership-class supercomputers. Employing state-of-the-art distributed training techniques, we evaluate the computational performance of various parallelization approaches at scale for a range of model sizes, and establish a projection model for the cost of full training. Our findings provide baseline results, best practices, and heuristics for pre-training such large models that should be valuable to HPC community at large. We also offer insights and optimization strategies for using the first exascale computing system, Frontier, to train models of the size of GPT-3 and beyond.