Performance Profile of Transformer Fine-Tuning in Multi-GPU Cloud Environments

by Edmon Begoli, Seung-hwan Lim, Sudarshan Srinivasan

Publication Type

Conference Paper

Book Title

2021 IEEE International Conference on Big Data (Big Data)

Publication Date

December, 2021

Page Numbers

3095 to 3100

Publisher Location

New Jersey, United States of America

Conference Name

2021 IEEE International Conference on Big Data

Conference Location

Orlanda, Florida, United States of America

Conference Sponsor

IEEE

Conference Date

Dec 14, 2021

View DOI Listing

Abstract

The study presented here focuses on performance characteristics and trade-offs associated with running machine-learning tasks in multi-GPU environments on both on-site cloud computing resources and commercial cloud services (Azure). Specifically, this study examines these tradeoffs by examining the performance of training and fine-tuning of transformer-based deep-learning (DL) networks on clinical notes and data, a task of critical importance in the medical domain. To this end, we perform DL-related experiments on the widely deployed NVIDIA V100 GPUs and on the newer A100 GPUs connected via NVLink or PCIe. This study analyzes the execution time of major operations to train DL models and investigate popular options to optimize each of them. We examine and present the findings on the impacts that various operations (e.g. data loading into GPUs, training, fine-tuning), optimizations, and system configurations (single vs. multi-GPU, NVLink vs. PCIe) have on the overall training performance.

Performance Profile of Transformer Fine-Tuning in Multi-GPU Cloud Environments

Abstract

Researchers

Organizations