Abstract
This study addresses the challenges inherent in building use type classification, particularly focusing on the issue of class imbalance in the training datasets for machine learning classifiers. We comprehensively analyze the efficacy of various class-balancing sampling techniques. Employing Monte Carlo simulations and Bayesian optimization, we evaluated the performance of multiple sampling methods, including Random Oversampling, Random Undersampling, SMOTE, Borderline-SMOTE, and ADASYN, across a dataset encompassing nine southeastern coastal states of the United States. Our findings reveal that simple random over- and undersampling techniques outperform more sophisticated methods. Additionally, we show inherent value in creating an imbalance in training data to effectively train a machine learning classifier for distinguishing between residential and nonresidential buildings. This study provides valuable guidance for future research on building use type classification research and lays essential groundwork for developing attribute-rich building stock datasets.