Skip to main content
News

Verónica Melesse Vergara: Troubleshooting the world’s smartest and fastest supercomputer

Topic: Supercomputing

The unique process of accepting a new supercomputer is one of the most challenging projects a programmer may take on during a career. When the Oak Ridge Leadership Computing Facility’s (OLCF’s) Verónica Melesse Vergara came to the United States from Ecuador in 2005, she never would have dreamed of being part of such an endeavor. But just last fall, she was.

Melesse Vergara remembers well her first class on scientific computing at Reed College and her thoughts after learning about the world of modeling and simulation through a project involving simulations of crystal formation.

“I thought, ‘Oh, that’s really interesting,’ but at the time I didn’t think there would be careers in that,” Melesse Vergara said. “My plan was to become a software engineer somewhere.”

Originally from Quito, Ecuador, Melesse Vergara came to the US as a mathematics and physics major, earning a bachelor of arts from Reed College in Portland, Oregon, in 2008. Having previously attended Universidad San Francisco de Quito, a school with nearly five times as many students as Reed, Melesse Vergara enjoyed the smaller class sizes and the emphasis on individualized learning.

“I enrolled at Reed as part of a foreign exchange program,” Melesse Vergara said. “I was only supposed to be there for a year, but I applied to transfer and completed my degree there instead.”

She subsequently pursued a master of science in computational science at Florida State University. The field seemed a natural fit for Melesse Vergara, who sought a practical way to combine her long-held knack for problem-solving and her love of programming, which had grown out of her family’s involvement in computing during her childhood.

“My dad worked for a consulting company and used to build computers for the company’s clients,” Melesse Vergara said. “People would specify what they wanted, and my dad would buy the pieces, put them together, and sell the computer.”

First, Melesse Vergara learned how to build websites and games with her brother. Then in middle school, she began learning programming languages, starting with C and C++. When she found physics and mathematics, her interest in scientific computing began to unfold.

“What really drew me to computing was that you could imagine something and create it from scratch,” Melesse Vergara said.

Her breakthrough came in graduate school, when she applied to be a student volunteer at the 2010 supercomputing conference SC10. After seeing the hundreds of high-performance computing (HPC) companies and organizations in attendance, she realized pursuing a technical career in the field was plausible.

After earning her master’s degree in 2011, she began her supercomputing career with a job that involved troubleshooting and debugging scientific applications on large cluster resources at Purdue University. In 2014, after a volunteer gig helping OLCF HPC storage systems engineer Dustin Leverman with SC’s student cluster challenge at SC13, she found herself at ORNL.

Now, Melesse Vergara is troubleshooting one of the largest supercomputing systems in the world.

No “I” in team

As an HPC engineer at the OLCF, Melesse Vergara helped lead the acceptance process for the world’s most powerful supercomputer—the OLCF’s 200-petaflop IBM AC922 Summit—last fall. The OLCF is a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).

As part of the System Test Working Group, she coordinated and organized test development and benchmarking of the massive architecture for ORNL. Specifically, she led the design and selection of test codes and problem cases to ensure the stability of the system.

“Our group dealt with all kinds of issues,” Melesse Vergara said. “We tried to make sure that every issue we found—whether it was with the compilers or other aspects of the system—was resolved before Summit went live.” Compilers help translate codes into instructions a computer can understand; therefore, their debugging is essential for efficient and smooth operation.

The Phase 1 acceptance test plan consisted of more than 380 tests, with more than 7,800 jobs running over a continuous 92-hour period during stability testing. Phase 2 of acceptance, which included all of Summit’s compute nodes, included more than 600 tests, with close to 30,000 jobs running over a 336-hour period during stability testing.

“During the Summit acceptance process, we ran benchmarks and full applications at different problem sizes to verify correctness while also evaluating performance,” Melesse Vergara said. “Then, we looked at what happened when we put as much load on the system as we could to test its stability.”

Even before Summit’s official acceptance, teams at the OLCF were running applications at exascale levels using what are called mixed-precision calculations. In November 2018, the OLCF successfully completed acceptance for the final Summit system.

Although Melesse Vergara’s troubleshooting skills extend back to her days at Purdue University, where she worked as a scientific applications analyst for 3 years, the size of HPC resources such as Summit makes her work especially challenging.

“Part of my job has been doing validation testing and making sure all the tests that have been developed will give us correct results,” Melesse Vergara said. “We had to wait for Summit’s hardware to arrive before we could determine if there were major problems that had to be addressed before acceptance could start. It was a complex process and required us to work closely with IBM development teams.”

The acceptance team consisted of 36 people from multiple groups at the National Center for Computational Sciences at ORNL, including staff members in the OLCF’s HPC Core Operations, HPC and Data Operations, User Assistance and Outreach, Scientific Computing, Technology Integration, and Computer Science Research Groups.

A blank slate

Melesse Vergara has worked with new architectures in the past, but before the Summit project, she had not dealt with prerelease technologies.

She said the most exciting aspect of working on a project such as Summit is the ability to run on a novel architecture with a specialized software stack, such as the one designed specifically for the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (CORAL). The goal of CORAL is to stand up leadership computers for these sites that will outperform current DOE leadership systems by 5 to 10 times.

After the HPC and Data Operations Group sets up and configures new supercomputing systems, Melesse Vergara’s team is among the initial groups to run scientific applications and codes on the system, troubleshooting bugs and problems along the way.

“Summit’s compute nodes were not generally available at the time, and that was both exciting and challenging,” Melesse Vergara said. “It’s not just an architecture that had never been in production before. It was deployed at a scale that hadn’t been tested before—and we got to make it work.”

Now, Melesse Vergara is focused on troubleshooting problems and helping users get the best out of Summit. At the end of the day, she finds purpose in ensuring researchers can work on some of the world’s most challenging scientific problems.

“I like knowing that we contributed to building an instrument of this scale that can help scientists discover things that they wouldn’t have been able to discover without it,” Melesse Vergara said. “To know that we are helping enable science is a really good feeling.”

ORNL is managed by UT-Battelle LLC for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov.