New Initiatives in Natural Language Processing (NLP)

Hongfang Liu, PhD, leader of the National COVID Cohort Collaborative (N3C) Natural Language Processing (NLP) Subgroup, has created a process for extracting important variables from the free text or unstructured portion of the electronic health record (EHR). Liu’s process will allow researchers to test their algorithm on data provided by the N3C to see how well it can de-identify unstructured data. Details about the infrastructure of the project can be found on the N3C NLP GitHub wiki page.

The NLP process is particularly important for COVID-19 research in the N3C Data Enclave because social determinants of health play a major role in assessing health outcomes, and this information is not captured well in the structured data fields of the EHR. Social, geographical, and environmental health implications mostly reside in the unstructured fields and can be misrepresented in data analysis if not taken into account. Application of NLP algorithms to access the information can help to provide a bigger picture of the health outcomes. Additionally, Long COVID symptoms do not yet have codes established and, therefore, need to be extracted from the unstructured fields of the health record to be incorporated into the research results. The NLP process can help find associated keywords and make them discoverable for correlating their relationship to COVID-19, thereby contributing to more robust research results.

Dr. Liu and Mayo Clinic have also been key contributors to, an open benchmarking platform that was recently launched. Led by Thomas Schaffter at Sage Bionetworks, the NLPSandbox streamlines the development and benchmarking of tools that are robust, reusable, and cloud-friendly for public and private datasets. The project has onboarded Medical College of Wisconsin as their first data partner. Additional data from Mayo Clinic and University of Washington will soon be incorporated, enabling multi-site evaluation and assessment of whether tool performance generalizes to multiple datasets. The service is now open for submissions. Learn more about how to get started from the NLPSandbox blog post.

A manuscript describing the pilot study that introduced and performance of state-of-the-art algorithms is currently in the works and will be submitted to a peer-reviewed journal in the upcoming months. Stay tuned for more innovations from the N3C. View the NLPSandbox This article was featured in June’s Ansible.

  • NLP
  • translational research

Publishing CTSA Program Hub’s Name
Center for Data to Health

CTSA Program In Action Goals
Goal 1: Train and Cultivate the Translational Science Workforce
Goal 2: Engage Patients and Communities in Every Phase of the Translational Process
Goal 5: Advance the Use of Cutting-Edge Informatics