Programming Titan

Hybrid architecture points the way to the future of supercomputing

Titan will be capable of performing 20,000 trillion calculations every second, making it six times more powerful than ORNL's current Jaguar system. Image: Andy Sproles

When the US Department of Energy asked researchers across a range of disciplines what they would do with a thousand-fold increase in computing power, ORNL computational scientist Jim Hack recalls, they all had the same need: the ability to create more detailed simulations.

"Increasing computational resources provides scientists with the opportunity to simulate whatever phenomenon or problem they're looking at with considerably more fidelity," says Hack, who directs the laboratory's National Center for Computational Sciences. "This increased accuracy can come in the form of greater resolution, or greater complexity. For example, when combustion researchers investigate how to burn different types of fuel more efficiently, they rely on computer models that involve a lot of complex chemistry and turbulent motion of gases. The more computational power they have, the more realistic the simulation will be."

Fundamental changes

This need for increased processing power is behind a fundamental change in supercomputer design that's taking place at ORNL and other leading centers of computational research.

"We have ridden the current architecture about as far as we can," Hack says. "Our Jaguar computer has 300,000 processor cores in 200 cabinets connected by a high-performance network. If we wanted to create a machine that is 10 times more powerful than Jaguar using the same architecture, we would have to put 2,000 cabinets on the floor, consume 10 times the power, et cetera."

The solution to this dilemma, at least for the NCCS, has been to design a machine that's different from Jaguar in two important ways. The first step in Jaguar's transformation was to boost the performance of the computer's central processing units by increasing the number of cores per node from 12 to 16.

"Over the last few years, especially, CPUs have gotten faster because vendors have incorporated more processing cores," Hack notes. "The processing cores aren't getting any faster, so the whole strategy for improving performance in the same footprint has been to add more processing cores. However, we're running out of room to go much further in that direction."

The remedy for this space constraint is the second part of the upgrade solution being put into practice in the new Titan supercomputer. In addition to amped-up CPUs, Titan will incorporate a feature found in many PCs—graphics processing units. On a laptop, GPUs are used to speed up video rendering, particularly for computer gaming apps. Unlike CPUs, which orchestrate the flow of information among various parts of a computer program, GPUs have a relatively simple task: run the computationally intensive parts of the program really, really fast.

New, portable code

Despite these upgrade modifications, when Titan is deployed, the basic footprint for the machine will be identical to Jaguar's. But instead of two 6-core processors, each of the 18,000 or so compute nodes will have a 16-core CPU paired with a GPU. The GPU has the potential to provide 10 times the performance of the CPU.

"The GPU is not as sophisticated as the CPU," Hack explains. "It waits for the CPU to say, 'I want you to do this to that data stream.' Then the data flows through the processor and an answer comes out the other side. This happens in thousands of parallel streams, so we can make better use of data locality—rather than moving data from node to node to node. With the right programming, GPUs work through the data in a very structured way."

In addition to processor-related improvements, Titan will employ an upgraded "interconnect," the internal network that enables compute nodes to talk to one another. The new network can handle a much higher volume of data and has a greater capacity to quickly re-route data when a data path is unexpectedly occupied or otherwise unavailable.

The linchpins holding these hardware improvements together are popular computer codes retooled by ORNL to take full advantage of Titan's rigorously parallel architecture. Much of this work is being done by the laboratory's Center for Accelerated Application Readiness. CAAR includes not only NCCS staff but also experts from various scientific disciplines and hardware and software vendors.

"When Titan comes online this fall, we expect it to offer researchers a full-fledged computing environment," Hack says. "Creating this environment has involved working with a wide range of software technologies and wasn't accomplished overnight. But it's easier now that we have better tools. Our collaboration with software vendors has helped us to evolve tools that make working on Titan easier for the application programmers."

CAAR's initial goal was to facilitate research in astrophysics, biology, bioenergy, chemistry, combustion and energy storage by identifying widely used application codes in these areas. When the group analyzed the results, they found that the underlying algorithms in six major codes accounted for a significant fraction of total machine usage at the time. So those codes were the first to be migrated from the current architecture to the new Titan architecture.

"The CAAR group is showing the way," Hack says. "They're the trailblazers who are demonstrating not only how to optimize code for Titan, but also how to do it a way that's performance-portable—meaning that when you're done writing your code you can run it on any machine, regardless of its architecture. At the same time, they are establishing best practices for other programmers to follow when migrating their own codes."

A step toward exascale

In addition to being a world-class scientific computing resource, Titan is a testbed for concepts that its designers hope will move computer technology toward the goal of "exascale" computing—about 50 to 100 times faster than Titan's expected top speed.

"Our main concerns in this area included limiting both Titan's footprint and its power consumption," Hack says. "It's important to note that a cabinet costs about the same to populate and operate whether it's on Jaguar or Titan. With the hybrid architecture, for about the same price and energy consumption, we will get about 10 times the performance out of Titan. This is a significant step toward making exascale computing practical and affordable."

Of course the boost in performance was achieved largely through the incorporation of GPU technology into Titan's compute nodes. The power of the GPU comes from its simplicity and raw processing power. The price for this computational advantage is the additional responsibility placed on programmers or application scientists to figure out how to structure their data so they can provide the processor with input that seamlessly streams through it.

CAAR is also working to provide programmers with the tools they'll need to pass information efficiently to and from the GPUs. Historically, GPU software has been somewhat proprietary in the sense that each of the video accelerator manufacturers has had its own instruction set that tells the GPU what to do. Titan's GPUs are manufactured by NVIDIA and use an instruction language called CUDA (Compute Unified Device Architecture).

"The need to program the GPU using CUDA presented a problem to our programmers, who were already a couple million lines deep in code written in FORTRAN or C++," Hack explains. "It wouldn't be practical for them to rewrite that code in CUDA. So our strategy was to prepare the community to migrate to this kind of architecture by working with NVIDIA and other hardware and software vendors to establish a common set of directives that allow software written in common scientific computing programming language to tell the GPU what set of CUDA operations it needs to perform on the data. "Code that has been re-factored in this way to run efficiently on Titan's GPUs has the added advantage of running faster on other processors that can take advantage of the fine-grained parallelism in the code as well," Hack says. "My message is that you can't lose with this kind of investment—the code allows any processor to take advantage of opportunities for concurrency and helps it to avoid moving data around unnecessarily."

Solvable problems

Hack observes that, if a scientist wants to tackle a problem with a computer simulation that would take a year's worth of computation to solve, then for practical purposes, it's an intractable problem. Solvable, problems are often a function of how much computational power the researcher has access to, so Titan's ability to bring greater power to bear on any given challenge automatically expands the realm of solvable problems.

"As technology gets better, we are able to tackle much more complex problems," Hack says. "More powerful machines provide scientists with the opportunity to get highfidelity answers. The opportunity to use Titan will enable researchers in a range of disciplines to build simulations that address important scientific questions faster, with more detail and with greater accuracy."



We're always happy to get feedback from our users. Please use the Comments form to send us your comments, questions, and observations.