Skip to main content
SHARE
Publication

Extending OpenSHMEM with Aggregation Support for Improved Message Rate Performance

by Aaron Welch, Oscar R Hernandez Mendoza, Stephen Poole
Publication Type
Conference Paper
Book Title
Euro-Par 2023: Parallel Processing
Publication Date
Page Numbers
32 to 46
Volume
14100
Publisher Location
Cham, Switzerland
Conference Name
29th International European Conference on Parallel and Distributed Computing (Euro-Par 2023)
Conference Location
Limassol, Cyprus
Conference Sponsor
University of Cyprus
Conference Date
-

OpenSHMEM is a highly efficient one-sided communication API that implements the PGAS parallel programming model, and is known for its low latency communication operations that can be mapped efficiently to RDMA capabilities of network interconnects. However, applications that use OpenSHMEM can be sensitive to point-to-point message rates, as many-to-many communication patterns can generate large amounts of small messages which tend to overwhelm network hardware that has predominantly been optimised for bandwidth over message rate. Additionally, many important emerging classes of problems such as data analytics are similarly troublesome for the irregular access patterns they employ. Message aggregation strategies have been proven to significantly enhance network performance, but their implementation often involves complex restructuring of user code, making them unwieldy. This paper shows how to combine the best qualities of message aggregation within the communication model of OpenSHMEM such that applications with small and irregular access patterns can improve network performance while maintaining their algorithmic simplicity. We do this by providing a path to a message aggregation framework called conveyors through a minimally intrusive OpenSHMEM extension introducing aggregation contexts that fit more naturally to the OpenSHMEM atomics, gets, and puts model. We test these extensions using four of the bale 3.0 applications which contain essential many-to-many access patterns to show how they can produce performance improvements of up to 65×.