Abstract
In the HPC area, both hardware and software move quickly. Often new hardware is developed and deployed, the corresponding software stack, including compilers and other tools, are under active development while leading edge software developers are working to port and tune their applications, all at the same time. While the software ecosystem is in flux, one of the key challenges for users is obtaining insight into the state of implementation of key features in the programming languages and models their applications are using – whether they have been implemented, and whether the implementation conforms to the specification, especially for newly implemented features (less tested by widespread use). OpenMP is one of the most prominent shared memory programming models used for on-node programming in HPC. With the shift towards accelerators (such as GPUs and FPGAs) and heterogeneous programming OpenMP features are getting more complex. It is natural to ask whether generative AI approaches, and large language models (LLMs) in particular, can help in producing validation and verification test suites to allow users better and faster insights into the availability and correctness of OpenMP features of interest. In this work, we explore the use of ChatGPT-4 to generate a suite of tests for OpenMP features. We have chosen a set of directives and clauses, a total of 78 combinations, which first appeared in OpenMP 3.0 (released in May 2008) but are also relevant for accelerators. We prompted ChatGPT to generate tests in the C and Fortran languages, for both host (CPU) and device (accelerator). On the Summit super-computer using the GNU implementation, we found that, of the 78 generated tests 67 C tests and 43 Fortran tests compiled successfully and fewer than those executed to completion. On further analysis we show that not all generated tests are valid. We document the process, results, and provide detailed analysis regarding the quality of tests generated. With the aim of providing input to a production quality validation and verification suite, we manually implement the corrections required to make the tests valid according to the current OpenMP specification. We quantify this effort as small, medium, or large, and record the lines of code changed to correct the invalid tests. With the corrected tests we validate recent implementations from HPE, AMD, and GNU on the Frontier supercomputer. Our experiment and subsequent analysis show that although LLMs are capable of producing HPC specific codes, they are limited by their understanding of the deeper semantics and restrictions of programming models such as OpenMP. Unsurprisingly more commonly used features have better support, while some OpenMP 3.0 directives such as sections and tasking are not universally supported on accelerators. We demonstrate that successful compilation and execution to completion are inadequate metrics for evaluating generated code and that, at this time, commodity LLMs require expert intervention for code verification. This points to gaps in the training data that is currently available for HPC. We demonstrate that with "small" effort 37% of generated invalid C tests and 63% of generated invalid Fortran tests could be corrected. This improves productivity of test generation as we circumvent writing from scratch and the common programming errors associated with it.