The Sound of Science

Exascale: The New Frontier of Computing

Exascale: The New Frontier of Computing

In May 2022, history was made at Oak Ridge National Laboratory. Frontier, the lab’s newest supercomputer, officially did what no other computer in the world had done before — it crossed the exascale barrier. If you're not familiar with the field of supercomputing, an exascale computer is an incredibly powerful system that is capable of a quintillion calculations per second. Frontier’s arrival marks a new era of computational performance that will help enable scientific breakthroughs never before possible. But this milestone didn't happen overnight. The journey to Frontier has been years in the making, with plenty of challenges and dramatic moments along the way. In this episode, you'll hear a behind-the-scenes account of what it took to launch the world’s first exascale computer. 



Transcript

[THEME MUSIC] 

EVAN SCHNEIDER: I'm absolutely thrilled, this is literally something I've been working towards for 10 years.  

JUSTIN WHITT: A major challenge in building supercomputers is that when you begin many of the technologies that you need, they don't exist yet. It's a bit like building an airplane while it's in flight. 

THOMAS ZACHARIA: Imagine that kind of capability. Imagine the kind of resource that we are placing in the hands of this new generation of scientists.  

 
[MUSIC TRANSITION] 

JENNY: Hello everyone and welcome to “The Sound of Science,” the podcast highlighting the voices behind the breakthroughs at Oak Ridge National Laboratory. 

MORGAN/JENNY: We’re your hosts, Morgan McCorkle and Jenny Woodbery. 

[MUSIC TRANSITION] 

JENNY: In May 2022, history was made at Oak Ridge National Laboratory. Frontier, the lab’s newest supercomputer, officially did what no other computer in the world had done before -- it crossed the exascale barrier.  

MORGAN: Now, if you’re not that familiar with the world of supercomputing, you may be doing a quick Google search to find out what the word exascale even means. But we’ll save you the trouble. 

JENNY: An exascale computer is an incredibly powerful system that is capable of a quintillion calculations per second. That’s a one with 18 zeros behind it.  

MORGAN: While powerful supercomputers exist around the world, none have achieved this level of performance until Frontier. 

JENNY: Since 2008, the fastest supercomputers in the world have been at the petascale. Meaning they can solve at least a quadrillion calculations per second. 

MORGAN: ORNL has been home to several petascale machines which ranked as the world’s fastest when they debuted – including the Summit supercomputer, which is capable of 200 quadrillion calculations per second.  

JENNY: While that’s a mindboggling number, as an exascale machine, Frontier’s peak performance will make it 10 times faster than Summit.  

MORGAN: It would take the entire population of Earth more than four years to calculate what Frontier can do in one second. 

JENNY: Frontier’s arrival marks a new era of unprecedented computational performance that will help enable scientific breakthroughs never before possible.  

MORGAN: Reaching a milestone like this doesn’t happen overnight. The journey to Frontier has been years in the making, with plenty of challenges and dramatic moments along the way. 

JENNY: So, settle in for a behind-the-scenes account of what it took to launch the world’s first exascale computer. 

[MUSIC TRANSITION] 

JENNY: Supercomputers are incredible machines that are essential to solving the most complex scientific problems out there.  

MORGAN: Their power and speed enable scientists to accelerate the development of life-saving drugs, model Earth systems to understand climate change, and create new materials for energy storage.  

JENNY: All tasks that would be far too complicated for the computers we use every day to handle. 

MORGAN: But unlike personal computers, a supercomputer can’t be bought off the shelf, fully assembled and ready to use. Getting one up and running is a much more complex process.  

JENNY: Thankfully ORNL has a lot of practice in deploying these powerful machines. Since 2004, the Oak Ridge Leadership Computing Facility, or OLCF, has deployed 10 supercomputers.  

MORGAN: Each generation of more powerful supercomputers brings new problems to overcome.  

WHITT: A major challenge in building supercomputers in general is that when you begin many of the technologies that you need, they don't exist yet. So, you’re designing a computer, you're writing software, and building a data center around the idea of new technologies, then, you know, working to make those technologies a reality, while adjusting to what is possible, you know, as you go along, it's a part of the key challenge. It's a bit like building an airplane while it's in flight.  

MORGAN: That’s Justin Whitt. As Frontier’s project director, he’s been the driving force behind deploying the lab’s latest supercomputer.  

JENNY: He and his team have been preparing for Frontier’s arrival since 2018. The project started with making sure the infrastructure needed for a machine of this scale was in place. 

WHITT: Really it’s three things, the computer technologies to actually build the computer, both hardware and software. But you also need a place to put the computer. So, we had to build a new data center to house the system, and the computer software, then that it takes for the researchers to actually make use of the computer. You know, for this, we had to bring in additional power and cooling capacity to, to the building.  

MORGAN: Frontier is housed in the same space that housed ORNL’s Titan supercomputer.  

JENNY: After years of groundbreaking science, Titan was decommissioned and recycled in 2019 to make way for Frontier. 

MORGAN: While they roughly share the same footprint – about the size of two basketball courts – Frontier is a much more heavy-duty machine.  

JENNY: Literally – each of its 74 cabinets weigh over 8,000 pounds. That’s the weight of two pickup trucks. So, to support the additional weight, thousands of load-bearing floor tiles needed to be installed. 

MORGAN: A more powerful computer also comes with the need for more power. ORNL worked with the Tennessee Valley Authority to run an additional 2.5 miles of high-voltage power lines to the lab to provide the computer room with more than 40 megawatts of power. 

JENNY: And when you’re bringing that much heat in, you also need a way to cool it down. 

WHITT: The Frontier computer, it can consume up to 29 megawatts of power at any time. That's an incredible amount of electricity, it's enough to power every home in a city roughly the size of Green Bay, for instance. The upgrades involve bringing new power lines from a nearby substation down to the data center, and then installing all the equipment needed to condition and distribute that much power inside our data center. And if you have that much power going in you, you know, you obviously have to get that heat back out. And these high-performance computers are liquid cooled, that's how we get a lot of the densities we get. And so we also build a new mechanical plant that's capable of circulating thousands of gallons of cooling water every minute to get that heat back out of the computer 

MORGAN: That’s 6,000 gallons a minute to be exact. That amount of water could fill up an Olympic-sized swimming pool in about 90 minutes. 

JENNY: While these infrastructure upgrades were being made, Hewlett Packard Enterprise and Advanced Micro Devices were chosen as the vendors to build Frontier. 

WHITT: It really started with finding the right technology partners for Frontier, we chose to work with Advanced Micro Devices, a company called AMD. And they produce the cutting-edge processors for the system. And then Hewlett Packard Enterprise to integrate those processors into the actual Frontier supercomputer. We defined with them along the way and investment strategy that allowed us to work with these partners to incubate key technologies that we knew we would need for Frontier.  

MORGAN: Planning and progress for Frontier was moving as scheduled until 2020, when a global pandemic added a whole other layer of complexity to the build. 

JUSTIN WHITT: It’s never been easy doing things at the scale. Each one we've done, it gets a little bit harder and a little bit harder. But if there's a lesson learned from all this is, if you can avoid it don't deploy a supercomputer in a pandemic.  

JENNY: ORNL and its private sector partners were not immune to the supply chain issues we’ve all experienced during the pandemic.  

WHITT: Going back to last summer, when we were expecting to start getting Frontier hardware. But, also when the chip shortages and electronic component shortages, were just starting to hit, uh you were reading about them in the newspaper, it was the same kind of shortages that were affecting the auto industries. You know, first estimates were a year and a half to two-year delay for Frontier. And we were like, “Um, you're kidding, we've come to all this way.” And we're here at the end, we're ready to get the components and start putting them together. We were just dumbfounded. It wasn’t even the chips necessarily. It was a lot of really common components. And it was the kind of things you would need to put any kind of silicone on any kind of compute board. So, voltage regulators, for instance, were a problem. You know there were 60 million parts needed for Frontier, over maybe a thousand part numbers, an incredible amount of moving components here. And HPE and AMD at one time, they had 15 people that were dedicated, that's all they were doing is trying to find parts to build Frontier for months at a time they were trying to find parts, they were calling warehouses, they were calling their competitors to see if they can make a deal. They would say, “Well, we have this component that you don’t have. Do you have this component that we need?” And this was changing on an hourly basis. Because the supply chain was so volatile, at that point, they would find parts, and those parts would disappear before they could get them.  

MORGAN: Despite months of uncertainty with sourcing the parts, it only set the project’s timeline back three months. 

JENNY: Delivery of the Frontier cabinets began in August 2021 and was complete by October. 

[MUSIC TRANSITION] 

MORGAN: While infrastructure upgrades were being made and computer parts were being sourced, the OLCF team was also working to prepare for Frontier’s ultimate mission – to deliver world-leading science.  

JENNY: OLCF is a U.S. Department of Energy Office of Science user facility, which means anyone can apply for time on the lab’s supercomputers.  

MORGAN: This is an incredibly competitive process, but it’s also free to those who are awarded time as long as they publish their results. Here’s Bronson Messer, OLCF’s director for science. 

BRONSON MESSER: One of the great things and one of the curses is that supercomputing power is in still in great demand, year after year, after year for our primary allocation programs, we have far, far too many requests to be able to grant them all. That are asking for far, far more time than we have on the machine. Essentially, every award represents a very large financial investment, but at times provided free as long as the knowledge that's gleaned from those those scientific campaigns is published in the open literature. Ultimately, that's what we want to do. We want to push out scientific results as effectively as possible. 

JENNY: To get ready for Frontier, OLCF worked with researchers to prepare codes and software to run on the new machine. 

MORGAN: Like its previous two computers, Frontier relies on a combination of graphic processing units, or GPUs, which most consumers know as graphics cards for video gaming -- and CPUs. The addition of GPUs in high-performance computers beginning with Titan in 2012 was a game-changer for the energy efficiency of supercomputers due to their low power consumption.  

JENNY: But ORNL also recognized that researchers would need some help in figuring out how to use the new platform. One way they did this is through OLFC’s Center for Accelerated Application Readiness, or CAAR.  

MESSER: So, when Titan first came on the scene, and we were the first facility to actually do hybrid node computing GPU plus CPU. We knew that that was going to be a bit of a lift for people. Programming GPUs is…is not the most straightforward thing if, if you've only ever done enough CPUs before. GPUs are really really fast, really, really low power and abysmally stupid. So, you have to tell them what to do very specifically. We did it for Titan. We did it for Summit, we were doing it again, for Frontier. So we so we picked eight codes for CAAR and they, all those teams are in pretty good shape to be able to take advantage of Frontier on day one. =They're chomping at the bit. 

JENNY: These projects run the gambit from fundamental nuclear physics to groundwater flow to molecular dynamics. 

MORGAN: There’s one project that Bronson, who is an astrophysicist by training, is particularly excited about. 

MESSER: And one really cool project is using a code called Cholla, the PI's Evans Schneider at the University of Pittsburgh. And she and her team are really want to model a Milky Way-like galaxy. And look at the galactic outflow from that, that galaxy to tell us something about the seeds of creation.  

JENNY: Evan Schneider is an assistant professor in the Department of Physics and Astronomy at the University of Pittsburgh. She developed the Cholla code in 2012 for supercomputer Titan to simulate the evolution of the galaxy, something you really couldn’t observe without a supercomputer.  

EVAN SCHNEIDER: One restriction on observations that we get in astronomy is that you're only ever getting a sort of snapshot of what's going on. Right. So things evolve in the universe, on cosmically long timescales. And so you want to understand how a galaxy like our Milky Way formed. Observations of galaxies, like our Milky Way are helpful, right? But you can't watch that evolution happening in real time. All you can do is go get a bunch of images of galaxies that we think look like the Milky Way, kind of try to stack them together over time and back out the process. So rather than just having snapshots of that galaxy over time, right, we'll put all of the physics that we need into our simulation software, set up something that looks like a Milky Way Galaxy, and then press play, and let it run, and just watch what happens. 

MORGAN: After running on Titan, the visualization the code produced gave Evan and her team valuable insights into the mysterious realm of galactic winds and how they affect the formation of the galaxy. 

JENNY: Running the code on Frontier will enable even more insights like this because the visualizations it will produce will have greater detail – or resolution – than they did on Titan. 

SCHNEIDER: One of the limiting factors is the resolution that we can get. So, the way that these models work, is basically you can imagine taking some three-dimensional volume of space, maybe the area of space that contains the galaxy that we're interested in, and then dividing it up into individual cells. Each of these cells can be say a little cube. And so you’ve got one big cube that’s divided up in many, many, many smaller cubes. And then we essentially solve differential equations that tell us how the properties of the matter in each of these little cells change over time. And that is at the end of the day, all that these, you know, models are doing. Right.  Taking some region of space, we’re dividing it up. We’re trying to understand the properties of the matter in that space and change it over time according these equations that are based on physical laws. And reason that we need really big computers to do this well is that you can imagine exactly the same way that if you have a really low-resolution image. If you can add more pixels to that image, it gets clearer. 

MORGAN: The visualizations produced from Cholla are not only rich with data, but they’re also visually stunning.  

SCHEIDER: I mean, I'm absolutely thrilled, right,  this is literally something I've been working towards for 10 years. And I think it's going to look stunning. As scientists, sometimes we get really caught up in, our very scientific looking plots, with your X and your Y axes. And those are great. We do a lot of amazing science with them. But I think the power of just sort of stunning images cannot be understated. 

[MUSIC TRANSITION] 

JENNY: In the supercomputing world, the Top 500 list is the definitive list of computer rankings. 

MORGAN: The rankings are released twice a year and organizations around the world eagerly await to see if their machine has clinched the coveted No. 1 spot.  

JENNY: ORNL’s Jaguar, Titan and Summit supercomputers all debuted at the top of the list. Justin Whitt and his team wanted Frontier to be no different.  

MORGAN: After the machine’s hard-won parts were delivered in fall of 2021, teams worked round the clock to get the massive computer up and running. All while the spring deadline for the Top 500 submission loomed in the background.  

JENNY: No pressure, right? 

JUSTIN WHITT: We were working around the clock to find that last bit of performance that got us over the hump. And we had, you know, engineers distributed all over the country that were up all night, running these codes, watching the output of them, and making adjustments and over and over again. 

MORGAN: A computer’s ranking is determined by a code that is run on the computer called the high performance LINPACK benchmark. 

JENNY: The LINPACK measures the machine’s floating-point compute power, or the number of times a calculation can be done per second. 

MORGAN: By the way, floating-point operations, or flops, is how supercomputing speed is measured. Each flop represents a possible calculation, such as an addition, subtraction, multiplication or division. 

JENNY: The team was getting impressive petaflop measures from the machine, but they had yet to cross the exascale barrier to see an exaflop number.  

MORGAN: But they kept running the program – up to the very last minute – to see if they could make exascale magic happen in time for the list. 

WHITT:  It really did. It came down to the wire, it came down within hours of the deadline that we finally had a run that went through that top the exaflop barrier on Frontier. That was just a tremendous moment, it was about 6 a.m. in the morning. We'd had several jobs that died. And we're like down to the wire and the jobs are dying. And everyone's just on the edge of their seat watching this. I was in my living room about 3 a.m. that morning, and we were watching the power profiles, and the power profile would go up and it'd be going in. So you know, the system just plumbing along. We're using 25 megawatts of power, like city of Oak Ridge kind of power, right? And then all of a sudden, something would go wrong, and the power would drop and your heart would just drop, you know, and you're like, “No!” and then it would come back up again. You're like, “Yay, it’s up. It's gonna survive!”  

JENNY: The hard work and sleepless nights paid off when they finally saw the number they’d all been waiting for – 1.1 exaflops. Making Frontier officially the first exascale supercomputer in the world, and the world’s fastest supercomputer.  

WHITT: And we were watching these runs and finally one went through and it was just pure elation. I mean, the chat channels, because that you know, everyone's at home that up all night, they just erupted with cheers and virtual high fives. And people were so excited because they knew that that we'd really done something that had never been done before and really made history by being the first computer to break the exaflop barrier. 

MORGAN: It was a proud moment that the team and ORNL Director Thomas Zacharia won’t soon forget.  
 
JENNY: Thomas accepted Frontier’s No. 1 ranking on the Top 500 list on May 31 in Hamburg, Germany. 

THOMAS ZACHARIA: You know, when I was in Hamburg, at the top 500 events, accepting the recognition that Frontier is the first true exascale system that the world has seen. I talked about my experience. And as I was, you know, thinking about what I was going to say, it hit me that I had the privilege of leading and helping with the deployment of 10 supercomputers. And the last four of them debuted as number one in the world. That is an awesome, awesome feeling. A proud moment, proud moment, proud for all our staff at the laboratory, not just in computing, but it takes it takes everyone, the Facilities and Operations Directorate staff, engineers in alignment, there's just a lot that has to happen behind the scenes in order to deliver a system like this. And but it's also a proud moment for the community.  

MORGAN: Long before becoming director of the lab, Thomas was the driving force behind ORNL’s leadership in supercomputing. Early on, he recognized the need for high-performance computing to solve some of the world’s most daunting problems. 

ZACHARIA: Most of my work in terms of supercomputing, the most of my work that I had the privilege of doing was on the Intel Paragon, which was a 150 gigaflops machine. And I’ve said this before, if you have an iPhone 6 or better, you have a supercomputer in your hand, that is more powerful than the Intel Paragon. And there are about 2 billion plus such devices globally. So, each successive supercomputers represent a tremendous advance. In my career, as I said, I have deployed 10 supercomputers. Each of them represents a journey in time that allows us a select few privileged, scientists and engineers to use this machine too…to investigate the art of the possible because we know that these machines, leading edge machines, gives you essentially a 20-year window, to go fast forward in time to understand what is art of the possible, and then shape, how society technology evolves, knowing that Frontier class systems will be in your hands, in your pockets 20 years from now, that is a tremendous advantage.  

JENNY: Frontier boasts unparalleled artificial intelligence and machine learning capabilities. This will help scientists tackle problems like climate change in ways no other machine has been able to do before. 

ZACHARIA: Now, we are now taking seriously energy transition and climate change as some of the compelling challenges that we as humanity faces. And in 2000 to 2004 timeframe, Japan invested close to a billion dollars to build Earth Simulator, which had just under 40 teraflops capability. It was designed to do climate research. One cabinet of Frontier is 635 times more powerful than Earth Simulator. So, imagine that kind of capability. Imagine the kind of resource that we are placing in the hands of this new generation of scientists, so that they can pursue their life's work in helping society in humanity, tackle these really compelling challenges in terms of how we drive to a Net Zero world. 

MORGAN: Frontier earned another accolade during its debut – it also topped the list ranking energy efficient supercomputers. 

ZACHARIA: So. Frontier is the first supercomputer of its size and scale that's also among the top in the Green 500, which is a most energy efficient machine. So, once you've obviously one cabinet of Frontier is No. 1, but the entire frontier system is just closely behind it at No. 2 on the Green 500. That means going from one cabinet to the full size of Frontier, we did not lose a lot of efficiency, that is remarkable. And so, Frontier is already beginning to reveal how the future is going to evolve, because the technology that was developed to build Frontier is going to be deployed by AMD and HPE, globally. And, and because it's so energy efficient, it is going to be much more of a pervasive technology because we can deploy it efficiently. 

JENNY: Frontier rounded out the twice-yearly rankings with the top spot in a newer category, mixed-precision computing, that rates performance in formats commonly used for artificial intelligence, with a performance of 6.88 exaflops. This makes it the most powerful supercomputer for machine learning ever built.  

MORGAN: While Frontier has officially made its debut, the work isn’t done for the team at OLCF. They will continue to run tests on the machine and allow early access scientists like Evan Schneider to start running their codes later this year. Their full scientific user program will start at the beginning of 2023. 

[MUSIC TRANSITION] 

JENNY: Frontier’s arrival has made history and we wanted to know what it feels like to be part of that moment. 

MORGAN: Here’s Justin Whitt again.  

WHITT: It's inspiring, you know, to think about the hundreds of people that have worked so hard to make this a reality. Scientists and engineers, craftsmen and various experts that span the Department of Energy and academia and our industry partners. At the same time, it's incredibly exciting to think about what researchers across scientific and engineering fields will accomplish when we put this new, incredibly powerful tool into their hands. 

JENNY: And for Thomas Zacharia, this historic milestone is meaningful on a personal level. After launching 10 supercomputers over his 35-year career at the lab, Frontier will officially be his last, as he is set to retire at the end of the year. 

ZACHARIA: Of course, I realize that I've had a very long and productive career, the laboratory 35 years is a long time, 35 years and five years of the lab director. And, and, you know, I think each one of these moments are special. I think they're different. But they're special. I would not put one more or less than the other. But that said, Frontier is pretty special. As someone said, globally, people will remember the machines and the institutions that crossed a threshold. Yes, Jaguar was a number one machine. And it was number one on Top 500. It was our first petascale machine for the Office of Science. But the first petascale machine that crossed the barrier was Roadrunner at Los Alamos. People will remember that forever. And in that regard, people remember Frontier and Oak Ridge National Laboratory forever, because it truly is the first machine that cross exascale barrier. That's a pretty big deal. 

[MUSIC TRANSITION] 

JENNY: Thank you for listening to this episode of The Sound of Science. 

MORGAN: We hope you enjoyed this episode and leave us a review wherever you get your podcasts.  

JENNY: Be sure to follow ORNL on social media so you don’t miss any of the amazing science that will be coming out of Frontier.  

MORGAN: Until next time!