Insights to Inspire 2022: The Journey Continues – Data Quality

I2I Informatics ResizeMove Toward a More Synergistic Model Helps Improve Data Quality


data quality stanford personnel

In re-imagining its Clinical Data Warehouse (CDW), Spectrum, the Stanford Center for Clinical and Translational Research and Education — home to the Stanford Clinical & Translational Science Award (CTSA) — has focused on improving the quality of its research data and increasing the frequency with which it refreshes that data. As a result, researchers can now participate in many more data studies than before, including Observational Health Data Sciences and Informatics (OHDSI) network studies and study-a-thons such as COVID-related descriptive studies.

The transformation involved moving from a federated model to a more synergistic model and transitioning to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, allowing for the systematic analysis of disparate databases.

“With the growth of our clinical research, we have had to move from a more federated approach to become much more synergistic. We still have a great deal to learn in producing these synergistic interactions, but thanks to our CTSA team we are making tremendous progress in that direction,” said Dr. Ruth O’Hara, PI and Director of the Spectrum CTSA and Senior Associate Dean, Research, in the School of Medicine at Stanford University.

“We still have a great deal to learn in producing these synergistic interactions, but thanks to our CTSA team we are making tremendous progress in that direction.”

Ruth O’Hara, PI and Director of the Spectrum CTSA and Senior Associate Dean, Research, in the School of Medicine at Stanford University.

Stanford launched its OMOP CDW in 2019 and expanded it in 2020. STARR-OMOP, a fully de-identified CDW containing all Stanford’s EHR data in the OMOP data model is now directly accessible to all University researchers for pre-Institutional Review Board studies. In addition to having its own instance of ATLAS, Stanford also enabled the execution engine capability of ATLAS, becoming one of the few academic institutions to do so. This greatly eases the process for researchers by allowing them to run OHDSI patient-level prediction, characterization, incidence-level rates and population-level estimation studies directly on the portal. The execution engine provides researchers a shortcut by allowing them to run network studies end to end. Since a great deal of code is required to run network studies, this shortcut enables the University to participate in network studies that it was not able to do easily before.

Stanford also dramatically improved its workflow, decreasing the refresh time of CDWs from quarterly delivery to weekly delivery. “Rewriting our pipelines to refresh weekly was an extensive but worthwhile undertaking,” said Priya Desai, Research Manager, Technology & Digital Solutions, Stanford University. “Allowing our researchers to have access to this data more frequently yielded a huge benefit, especially for COVID studies, because numbers were changing so rapidly.”

Including data in flow sheets was also very useful for researchers, but the enormous amount of data – more than 1,850 templates, 4,685 groups and 25,777 distinct types of row measures – made manual mapping difficult and time-consuming. To overcome this, unmapped flow-sheet data were brought into the observation table, adding 3.6 billion table rows. The CTSA informatics team also worked to make sure its person, provider, care site, visit detail and visit occurrence IDs were stable in its de-identified cohort between runs.

A golden data set As it set out to improve processes to deliver refreshed data on a weekly basis, the team decided it needed to check its data against a “golden data set.” It created a regression test to benchmark against a much smaller data set of 150,000 patients. Any improvements or modifications made to the code were checked against the regression test before being moved into production.

Improvements to the development pipeline also were needed to churn out data weekly. Previously, this process was not as well-defined, and, as a result, branches and data sets were being created in silos. Under the new approach, the process is more streamlined, regular code reviews are conducted like clockwork and nothing goes into the master build until the entire team has reviewed it.

“Checking data against the golden data set in the new approach helps minimize errors,” explained Desai. “When the data is released to our investigators, they can be sure that no mistakes will be discovered months down the road.”

While improving the development process was always a goal, the COVID-19 pandemic provided the catalyst to the team to expedite the effort. Patient visit information was re-mapped to encompass a broader spectrum. Previously, the data vocabulary included inpatient, outpatient and emergency room visits. During COVID, more granular visit concept IDs were added, such as certain types of lab measurements, ICU and intubated patients, and telehealth. These improvements will continue to be implemented in all future research studies.

Getting the word out The team had leveraged scalable cloud technologies to get all of Stanford’s EHR data into the OMOP Common data format and made available via Nero, a HIPAA compliant Big Data Computing Platform. However, this meant changes for the users’ workflow. Team members decided early on that they were not just creating a product but a movement—a more modern way of performing Clinical Data Science at Stanford. To spread the word, the team conducted a series of small, in-person workshops (virtual since COVID-19 hit), provided resource materials, and made available online tutorials on such items as clinical notes and structured data. Open office hours were held biweekly to answer any questions.

These efforts paid off. At launch, approximately 15 researchers immediately took advantage of the OMOP CDW. The number of users has grown rapidly to about 200 researchers today.

Lessons Learned

  • Hospital workflow is complex, so it is crucial to understand the typical patient/clinic workflow and the order of patient interactions to understand what is recorded in certain tables.
  • Faculty input regarding their needs and interests is invaluable during the development process.
  • Fostering a good working relationship with your hospital is essential.
  • The University of Rochester Center for Leading Innovation & Collaboration (CLIC) is an invaluable partner in the drive toward the OMOP common data model as part of the Informatics Common Metric. Through CLIC, Stanford gained access to what other institutions were doing in this area and was able to engage in results comparisons with them.

Stanford University’s reference materials are available here. 

Register for the Data Quality webinar on Thursday, June 9th at 2:00 Eastern time to hear how Stanford University improved the quality of its research data and increased the frequency of its data refresh.

  • Data Quality
  • Insights to Inspire
  • Informatics
  • Featured Hub Blog
  • Data Models