Deep Learning is a sub-field of machine learning that focuses on learning features from data through multiple layers of abstraction. These features are learned with little human domain knowledge and have dramatically improved state of the art in many applications from computer vision to speech recognition. Due to the inability for any one network to best generalize for all datasets, a necessary step before applying DL to a new dataset is selecting an appropriate set of hyper-parameters. For CNNs, these hyper-parameters include the number of layers, the number of hidden units per layer, the activation function for a layer, the kernel size for a layer, the arrangement of these layers within the network, etc. Selecting a new network topology for a previously unused dataset can be time-consuming and tedious task. The number of hyper-parameters being tuned and the evaluation time for each new set of hyper-parameters makes their optimization in the context of deep learning particularly difficult.
Studies of the effects of hyper-parameters on different deep learning architectures have shown complex relationships, where hyper-parameters that give great performance improvements in simple networks do not have the same effect in more complex architectures. These studies also show that the results on one dataset may not transfer to another dataset with different image properties, prior probability distributions, number of classes, or number of training examples. With no explicit formula for choosing a correct set of hyper-parameters, their selection often depends on a combination of previous experience and trial-and-error. Due to the computationally expensive nature of DL algorithms, which can take days to train on conventional platforms, repeated trial and error is both inefficient and not thorough. While the DL community has settled on a set of hyper-parameters that often work well for common image and speech recognition problems, it’s not clear that such parameters extend to domains with different data modality or structure or if such hyper-parameters are indeed optimal. This work proposes to address the model selection problem and ease the demands on data researchers using MENNDL, an evolutionary algorithm that leverages a large number of compute nodes. These nodes communicate over MPI to distribute the task of finding the optimal hyper-parameters across the nodes of a super computer.