Abstract
This paper presents a multimodal data representation to improve the performance of deep learning models for extracting cancer key characteristics from unstructured text in pathology reports. Specifically, in addition to using the text as the input to deep learning models, we use concept unique identifiers (CUIs) as another source of information to the models. We analyze the performance of different text and CUI data representations, including word embeddings and bag of embeddings (BOE), with a convolutional neural network (CNN) and a fully connected multilayer perceptron neural network (MLP-NN). The high level document embeddings from text and CUI inputs are combined by concatenating them and then applying a classifier. The model is used for extracting cancer subsite and histology from pathology reports. These two classification tasks have a large number of labels, i.e. 317 for subsite and 556 for histology, with extreme class imbalance. We compare the performance of the developed DL models across the two tasks based on micro- and macro-F1 scores. The evaluation shows that a multi-channel DL model that utilizes text represented by word embeddings and CUIs represented by BOE outperforms other DL models. Also, this approach significantly improves the model performance on low prevalence classes.