Christian Engelmann

Christian Engelmann

Research and Development Staff

Bio

Dr. Christian Engelmann is an R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.4 billion. He has 17 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC.

His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processors, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as part of a double diploma, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and a Member of the Institute of Electrical and Electronics Engineers (IEEE), the Society for Industrial and Applied Mathematics (SIAM), and the Advanced Computing Systems Association (USENIX).

Awards

2015 US Department of Energy Early Career Research Award

Projects

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Extreme-scale, high-performance computing (HPC) will significantly advance discovery in fundamental scientific processes by enabling multiscale simulations that range from the very small, on quantum and atomic scales, to the very large, on planetary and cosmological scales. Computing at scales in the hundreds of petaflops, exaflops, and beyond will also lend a competitive advantage to the US energy and industrial sectors by providing the computing power for rapid design and prototyping and big data analysis. Yet, to build and effectively operate extreme-scale HPC systems, the US Department of Energy cites several key challenges, including resilience, or efficient and correct operation despite the occurrence of faults or defects in system components that can cause errors. These innovative systems require equally innovative components designed to communicate and compute at unprecedented rates, scales, and levels of complexity, increasing the probability for hardware and software faults. This research project offers a structured hardware and software design approach for improving resilience in extreme-scale HPC systems so that scientific applications running on these systems generate accurate solutions in a timely and efficient manner. Frequently used in computer engineering, design patterns identify problems and provide generalized solutions through reusable templates. Using a novel resilience design pattern concept, this project identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout hardware and software components in HPC systems. This effort will create comprehensive methods and metrics by which system vendors and computing centers can establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components and optimize the cost-benefit trade-offs among performance, resilience, and power consumption. Reusable programming templates of these patterns will offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different design trade-offs. For more information, please visit http://ornlwiki.atlassian.net/wiki/display/RDP .

Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

Department of Energy (DOE) leadership computing facilities are in the process of deploying extreme-scale high-performance computing (HPC) systems with the long-range goal of building exascale systems that perform more than a quintillion (a billion billion) operations per second. More powerful computers mean researchers can simulate biological, chemical, and other physical interactions with an unprecedented amount of realism. Yet as HPC systems become more complex, vendors that manufacture components and computing facilities are preparing for unique computing challenges — particularly the occurrence of unfamiliar or more frequent faults in hardware technologies and software applications that can lead to computational errors or system failures. This project will help DOE computing facilities protect extreme-scale systems by characterizing potential faults and creating models that predict their propagation and impact. The forthcoming Collaboration of Oak Ridge, Argonne and Lawrence Livermore National Laboratories (CORAL) is a private/public partnership that will stand up three extreme-scale systems in 2017, each operating at about 150 to 200 petaflops, or nearly 10 times more power than the 27-petaflop Titan at Oak Ridge National Laboratory (currently the fastest in the United States) and about a tenth of exascale power. By monitoring hardware and software performance on current DOE systems such as Titan and applying the data to fault analysis and vulnerability studies, this effort will capture observed and inferred fault conditions and extrapolate this knowledge to CORAL and other extreme-scale systems. Using these analyses the project team will create assessment tools, including a fault taxonomy and catalog and fault models, to provide computing facilities with a clear picture of the fault characteristics in DOE computing environments and inform technical and operational decisions to improve resilience. The catalog, models, and open-source software resulting from this project will be made publicly available as well. For more information, please visit http://ornlwiki.atlassian.net/wiki/display/CFEFIES .

Publications

Important Peer-reviewed Journal Publications

[1] Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Journal of Supercomputing Frontiers and Innovations (JSFI), volume 4, number 3, pages 4-42, 2017. South Ural State University Chelyabinsk, Russia. ISSN 2409-6008.

[2] Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus. International Journal of High Performance Computing Applications (IJHPCA), 2017. SAGE Publications. ISSN 1094-3420.

[3] Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634.

[4] Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, 2014. SAGE Publications. ISSN 1094-3420.

[5] Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X.

[6] Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.

[7] Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315.

Important Peer-reviewed Conference Publications

[1] Saurabh Gupta, Devesh Tiwari, Tirthak Patel, and Christian Engelmann. Reliability of HPC systems: Large-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, Denver, CO, USA, November 12-17, 2017. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 18.7% (61/327). To appear.

[2] Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. Acceptance rate 22.4% (58/259).

[3] David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. Acceptance rate 24.2% (43/178).

[4] Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Large Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. Acceptance rate 23.0% (114/496).

[5] David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. Acceptance rate 21.2% (100/472).

[6] James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. Acceptance rate 13% (71/515).

[7] Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. Acceptance rate 21% (118/569).

[8] Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. Acceptance rate 28.1% (48/171).

[9] Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9.

[10] Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. Acceptance rate 19.8% (50/253).