The Scalable Systems Software project coordinated by ORNL is fundamentally
changing the way future high-end systems software is developed
to make it more cost effective, robust, and scalable to multi-teraflop
supercomputers.

Stephen Shevlin (foreground) and Tom Dunigan (standing) discuss with
Pratul Agarwal the image on the computer monitor which resulted
from a simulation of a protein. The IBM Power4 (Cheetah) supercomputer
at CCS was used to perform multi-scale modeling of vibrations
in the protein cyclophilin A, which is related to HIV infections.
|
|
System administrators
and managers of terascale computer centers are facing a crisis. The
nation's premiere scientific computing centers all use incompatible,
ad hoc sets of systems tools that were not designed to scale to the
multiteraflop systems being installed in supercomputer centers today.
One solution would be for each computer center to take their homegrown
software and rewrite it to be scalable. But this approach would incur
a tremendous duplication of effort and delay the availability of terascale
computers for scientific discovery.
The purpose
of the Scalable Systems Software project is to provide a much more
timely and cost-effective solution by pulling together representatives
from the major computer centers and
industry and having them collectively
define standardized interfaces between
system components. At the same time this group can produce a fully
integrated suite of systems software components that can be used by
the nation's largest scientific computing centers.
The scalable systems software suite
is being designed to support computers that scale to very large physical sizes
without requiring that the number of support staff scale along with
the machine. This strategy goes beyond just creating a collection of separate scalable
components. By defining a software architecture and interfaces between system
components, the Scalable Systems Software research is creating an
interoperable framework for the components.
| |

One of our top award winners associated with software tool development
at CCS is Jack Dongarra, who directs UT's Innovative Computing
Laboratory. He recently won two R&D 100 Awards, was elected a
member of the National Academy of Engineering, and earned a Fernbach
Award. He annually compiles a list of the Top 500 supercomputers
based on peak performance.
|
This makes it much easier and cost effective for supercomputer centers
to adapt, update, and maintain the components in order to keep up with new
hardware and software. A well defined interface allows a site to
replace or customize individual components as needed. Defining the interfaces between
components across the entire system software architecture provides
an integrating force between the system components as a whole and improves the
long-term usability and manageability of terascale systems at supercomputer centers
across the country.
Systems interfaces are being standardized
using a process similar to that
employed to successfully define the message
passing standard (MPI). This process
is an open forum of university, lab, and
industry representatives who meet regularly
to propose and vote on pieces of the
standard. The figure at the bottom of this page represents
the significant progress to date on
producing scalable components and defining
standardized interfaces between
them. The bold lines represent working
interfaces. The light lines represent interfaces
in progress. The colors of the
components represent which of the four multi-lab working groups inside
the project is responsible for it.
In November
2003 the first release of a complete, integrated set of scalable systems
components was made. This distribution
utilized the popular OSCAR packaging
and install technology. A second release
is scheduled in March 2004. This
past year the system administrators at
Argonne National Laboratory decided to
switch their "Chiba City" cluster to use
our scalable systems suite exclusively. In
January 2004 the suite underwent scale
tests on the 5160 processor Titanium
cluster at the National Center for
Supercomputer Applications. Our research
has developed software to provide communication
service between components over
multiple protocols as well as a flexible authentication
scheme to provide security to
the overall system. Research continues to
harden the working prototypes, improve
integration, and increase scalability to the
target of 10,000 processor systems.
The coordinator
for this project is Al Geist, an ORNL corporate fellow. The participating
organizations include seven Department of Energy
laboratories, three National Science
Foundation supercomputer centers, and
five supercomputer vendors. The DOE
labs are ORNL, ANL, Ames, Lawrence Berkeley,
Los Alamos, Pacific Northwest, and
Sandia national laboratories. The NSF
sites are the NCSA, Pittsburgh Supercomputer
Center, and the San Diego
Supercomputer Center. The vendors are
IBM, Silicon Graphic, Cray, Hewlett
Packard, and Intel.
What is the impact of this project?
The Scalable Systems Software project is
a catalyst for fundamentally changing the
way future high-end systems software is
developed and distributed. It will reduce
facility management costs by: reducing
the need to support home-grown software,
making higher quality systems tools available,
and providing the ability to get new
machines up and running faster and keep
them running. The project will also facilitate
more effective use of machines by
scientific applications by providing scalable
job launch, standardized job monitoring
and management software, and allocation
tools for the cost-effective management
and utilization of terascale computer
resources.

System components presently under development and their interfaces. Dark
lines represent working interfaces.
|
|