Skip to main content

JSONize: A Scalable Machine Learning Pipeline to Model Medical Notes as Semi-structured Documents...

by Everett N Rush Iii, Ioana Danciu, George Ostrouchov, Benjamin W Mayer, Edmon Begoli
Publication Type
Conference Paper
Book Title
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
Publication Date
Page Numbers
533 to 541
Publisher Location
United States of America
Conference Name
AMIA Informatics Summit
Conference Location
Houston, Texas, United States of America
Conference Sponsor
Conference Date

The Department of Veteran's Affairs (VA) archives the largest corpora of clinical notes in their corporate data warehouse (CDW) as unstructured text data. Unstructured text easily supports keyword searches and regular expressions. Often these simple searches do not adequately support the complex searches that need to be performed on notes. For example, a researcher may want all notes with a Duke Treadmill Score less than 5 or people that smoke more than 1 pack per day. Range queries like this and more can be supported by modelling text as semi-structured documents. In this paper, we implement a scalable machine learning pipeline that models plain medical text as useful semi-structured documents. We improve on existing models and achieve a F1-score of 0.912 and scale our methods to the entire VA corpus.