System administrators and managers of terascale computer centers are facing a crisis. The nation's premiere scientific computing centers all use incompatible, ad hoc sets of systems tools that were not designed to scale to the multiteraflop systems being installed in supercomputer centers today. One solution would be for each computer center to take their homegrown software and rewrite it to be scalable. But this approach would incur a tremendous duplication of effort and delay the availability of terascale computers for scientific discovery.
The purpose of the Scalable Systems Software project is to provide a much more timely and cost-effective solution by pulling together representatives from the major computer centers and industry and having them collectively define standardized interfaces between system components. At the same time this group can produce a fully integrated suite of systems software components that can be used by the nation's largest scientific computing centers.
The scalable systems software suite
is being designed to support computers that scale to very large physical sizes
without requiring that the number of support staff scale along with
the machine. This strategy goes beyond just creating a collection of separate scalable
components. By defining a software architecture and interfaces between system
components, the Scalable Systems Software research is creating an
interoperable framework for the components.
Systems interfaces are being standardized using a process similar to that employed to successfully define the message passing standard (MPI). This process is an open forum of university, lab, and industry representatives who meet regularly to propose and vote on pieces of the standard. The figure at the bottom of this page represents the significant progress to date on producing scalable components and defining standardized interfaces between them. The bold lines represent working interfaces. The light lines represent interfaces in progress. The colors of the components represent which of the four multi-lab working groups inside the project is responsible for it.
In November 2003 the first release of a complete, integrated set of scalable systems components was made. This distribution utilized the popular OSCAR packaging and install technology. A second release is scheduled in March 2004. This past year the system administrators at Argonne National Laboratory decided to switch their "Chiba City" cluster to use our scalable systems suite exclusively. In January 2004 the suite underwent scale tests on the 5160 processor Titanium cluster at the National Center for Supercomputer Applications. Our research has developed software to provide communication service between components over multiple protocols as well as a flexible authentication scheme to provide security to the overall system. Research continues to harden the working prototypes, improve integration, and increase scalability to the target of 10,000 processor systems.
The coordinator for this project is Al Geist, an ORNL corporate fellow. The participating organizations include seven Department of Energy laboratories, three National Science Foundation supercomputer centers, and five supercomputer vendors. The DOE labs are ORNL, ANL, Ames, Lawrence Berkeley, Los Alamos, Pacific Northwest, and Sandia national laboratories. The NSF sites are the NCSA, Pittsburgh Supercomputer Center, and the San Diego Supercomputer Center. The vendors are IBM, Silicon Graphic, Cray, Hewlett Packard, and Intel.
What is the impact of this project? The Scalable Systems Software project is a catalyst for fundamentally changing the way future high-end systems software is developed and distributed. It will reduce facility management costs by: reducing the need to support home-grown software, making higher quality systems tools available, and providing the ability to get new machines up and running faster and keep them running. The project will also facilitate more effective use of machines by scientific applications by providing scalable job launch, standardized job monitoring and management software, and allocation tools for the cost-effective management and utilization of terascale computer resources.
Web site provided by Oak Ridge National Laboratory's Communications and External Relations