JSONize: A Scalable Machine Learning Pipeline to Model Medical Notes as Semi-structured Documents

by Everett N Rush Iii, Ioana Danciu, George Ostrouchov, Benjamin W Mayer, Edmon Begoli

Publication Type

Conference Paper

Book Title

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

Publication Date

May, 2020

Page Numbers

533 to 541

Publisher Location

United States of America

Conference Name

AMIA Informatics Summit

Conference Location

Houston, Texas, United States of America

Conference Sponsor

AMIA

Conference Date

Mar 23, 2020 - Mar 26, 2020

Abstract

The Department of Veteran's Affairs (VA) archives the largest corpora of clinical notes in their corporate data warehouse (CDW) as unstructured text data. Unstructured text easily supports keyword searches and regular expressions. Often these simple searches do not adequately support the complex searches that need to be performed on notes. For example, a researcher may want all notes with a Duke Treadmill Score less than 5 or people that smoke more than 1 pack per day. Range queries like this and more can be supported by modelling text as semi-structured documents. In this paper, we implement a scalable machine learning pipeline that models plain medical text as useful semi-structured documents. We improve on existing models and achieve a F1-score of 0.912 and scale our methods to the entire VA corpus.

JSONize: A Scalable Machine Learning Pipeline to Model Medical Notes as Semi-structured Documents

Abstract

Researchers

Organizations