Comparing Three Methods of Extracting COVID-19 Related Symptoms from EHR Data in a Large Healthcare System
The COVID-19 pandemic has claimed over 220,000 lives in the United States. A promising resource for discovery in COVID-19’s symptom progress is data documented in electronic health record (EHR) systems as part of clinical care. Such data are stored in disparate locations within the EHR, requiring multiple extraction methods.
Symptoms were extracted from EHR data for all patients who were tested for SARS CoV-2 through May 31, 2020 from a single large healthcare system. Three methods were used: 1) extraction of ICD-10 codes, 2) regular expression matching of clinical notes using a COVID-19 note template developed for standard use across the health system, and 3) a Natural Language Processing (NLP) pipeline applied to clinical notes. Symptoms were grouped if they were documented in the EHR within 10 days prior to SARS CoV-2 PCR lab test.
SARS CoV-2 PCR tests were conducted across 24,775 unique patients, who were given 32,924 total tests between February 29 and May 31, 2020. The study cohort was refined to 14,159 patients who had a test and an encounter with a provider during the study period. COVID-19 related symptoms were extracted at differential rates across sources within the EHR. NLP detected the most symptoms of all extraction methods, 25,433 (91.9%) of symptoms. 17,904 (64.7%) of symptoms were detected only by NLP and no other methods. The ICD data source added 1,969 (7.1%) symptoms that were not already captured by NLP. Parsing of notes using regular expression extraction from a known structure added 276 (1.0%) more symptoms.
Discussion & Conclusion
All three extraction methods contributed to COVID-19 symptom detection, with NLP detecting the large majority of symptoms and ICD coded data detecting the least number of symptoms. A standardized note template containing a discrete checklist of COVID-19 related symptoms led to simple and highly accurate text parsing, but due to infrequent clinical use, its ability to increase symptom detection was limited. Given NLP methods resulted in the highest extraction rate of COVID-19 related symptoms, using only methods such as regular expression extraction and structured data extraction of ICD codes may miss a significant amount of symptom data.