Skip to main content
SHARE
Publication

Multimodal Data Representation with Deep Learning for Extracting Cancer Characteristics from Clinical Text

Publication Type
Conference Paper
Book Title
2020 IEEE International Conference on Big Data (Big Data)
Publication Date
Page Numbers
4226 to 4232
Publisher Location
United States of America
Conference Name
IEEE International Conference on Big Data
Conference Location
Virtual, Georgia, United States of America
Conference Sponsor
IEEE
Conference Date
-

This paper presents a multimodal data representation to improve the performance of deep learning models for extracting cancer key characteristics from unstructured text in pathology reports. Specifically, in addition to using the text as the input to deep learning models, we use concept unique identifiers (CUIs) as another source of information to the models. We analyze the performance of different text and CUI data representations, including word embeddings and bag of embeddings (BOE), with a convolutional neural network (CNN) and a fully connected multilayer perceptron neural network (MLP-NN). The high level document embeddings from text and CUI inputs are combined by concatenating them and then applying a classifier. The model is used for extracting cancer subsite and histology from pathology reports. These two classification tasks have a large number of labels, i.e. 317 for subsite and 556 for histology, with extreme class imbalance. We compare the performance of the developed DL models across the two tasks based on micro- and macro-F1 scores. The evaluation shows that a multi-channel DL model that utilizes text represented by word embeddings and CUIs represented by BOE outperforms other DL models. Also, this approach significantly improves the model performance on low prevalence classes.