Skip to main content

Characterizing Sub-Cohorts via Data Normalization and Representation Learning...

Publication Type
Conference Paper
Book Title
IEEE 33rd International Symposium on Computer Based Medical Systems
Publication Date
Conference Name
IEEE International Symposium on Computer Based Medical Systems (CBMS)
Conference Location
Rochester, Minnesota, United States of America
Conference Sponsor
Conference Date

The process of identifying a cohort of interest is a very challenging task. It requires manually inspecting many patient records of complex structure that might include medical coding errors and missing data. This paper presents a computational pipeline for refining the process of cohort selection based on medical concepts recorded in the electronic health records (EHRs). The pipeline extracts EHR data for a given cohort and normalizes this data using standard vocabularies. Then a stacked denoising autoencoder is used to embed the normalized patient vectors in a low dimensional space, where the patients are subsequently clustered into sub-cohorts. The goal is to represent the cohort in a standard format and abstract variants of sub-populations. As a use-case, we applied the pipeline to 1.8 million Veterans diagnosed with major depressive disorder (MDD), and identified four meaningful sub-cohorts using the features learned by the autoencoder. Then, each sub-cohort was explored using a set of keywords for interpretation.