BACKGROUND: As artificial intelligence and machine learning techniques emerged in biomedical informatics, security and privacy concerns over the data and subject identities have also become an issue and essential research topic. The machine learning models are agnostic to the personal identity but find any patterns and features that improve task performance.
OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed directly by the contents having private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate privacy leakage.
METHODS: The target model is the multi-task convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from the participated multiple state cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments.
RESULTS and CONCLUSIONS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.