Abstract
The Department of Veteran's Affairs (VA) archives the largest corpora of clinical notes in their corporate data warehouse (CDW) as unstructured text data. Unstructured text easily supports keyword searches and regular expressions. Often these simple searches do not adequately support the complex searches that need to be performed on notes. For example, a researcher may want all notes with a Duke Treadmill Score less than 5 or people that smoke more than 1 pack per day. Range queries like this and more can be supported by modelling text as semi-structured documents. In this paper, we implement a scalable machine learning pipeline that models plain medical text as useful semi-structured documents. We improve on existing models and achieve a F1-score of 0.912 and scale our methods to the entire VA corpus.