Vision for a Sustainable Data Ecosystem: Connecting the Dots

The CTSA Program strives to deliver scientific and system changes that solve the many outstanding problems limiting the efficiency, effectiveness, and reach of clinical translational research, and thus get more treatments to more patients more quickly across the country. As the largest clinical translational science network and as an exemplar to team science, NCATS envisions the CTSA Program becoming a standards-based, interoperable, cloud network where informatics assets like data, software, and algorithms can be co-developed and shared across the consortium.

This type of sustainable data ecosystem can increase the quality of clinical research by expanding access to different types of clinical data from different sources, all harmonized and linked to provide a more comprehensive clinical picture. The use of innovative informatics solutions to address widely appreciated systematic barriers, such as data interoperability and the use of Electronic Health Records (EHRs) for the purposes of conducting research and improving health, is one way the CTSA Program can advance clinical and translational science to improve human health.

Historically, EHRs arose independent of others and each with their own unique peccadillos. Developed primarily for billing, health care was a distant second in priority and health research was all but ignored. As a result, data from one EHR was not easily aggregated with data from another since even designations such as male versus female could be labeled differently. While the idealized solution was for everyone to reach consensus, historical legacy and institutional inertia has precluded an obvious evolutionary advance. Instead, as computers have become more powerful and software more adroit, letting the digital realm handle the harmonization has come to the forefront.

Besides EHRs there is an abundance of other types of data (imaging, omics, waveforms, mobile, time-based data, etc.) that could be used for research if they were to become interoperable and connected.  These data can come from different sources such as clinical databases, research datasets, sensors, mobile technology, patient generated data, and publicly available data sets.

To realize the potential of connecting these “Dots” hubs are encouraged to embrace a culture of Open Science and Data Sharing that adhere to the F.A.I.R. principles (see: NIH Strategic Plan for Data Science).  Indeed, over the past two years the CTSA hubs have developed a Common Metric for Informatics to increase interoperability of EHR data across the CTSA consortium, and this summer CLIC will start to highlight the innovative and unique strategies implemented by hubs to enhance this metric through the Insights to Inspire webinar and blog series.

The National COVID Cohort Collaborative (N3C) represents the pinnacle of data harmonization allowing for the aggregation of a large dataset constructed from EHRs representing multiple distinct data models and has shown the power of the CTSA hubs to demonstrate team science and address research questions to speed discoveries that were not addressable before its inception.  This centralized, harmonized, high-granularity electronic health record repository is the largest, most representative U.S. cohort of COVID-19 cases and controls to date (>600k COVID cases and still growing) and represents a unique accomplishment by NCATS and the CTSA Program towards addressing the greatest pandemic to strike the planet in >100 years.  Beyond what we will learn about COVID (and with >100 projects involving >1200 investigators from >300 institutions there is a lot to learn), N3C is also teaching us much about data interoperability and harmonization along with the scientific questions we can now address because of large datasets and combining different types of data.

Ultimately, these enhanced abilities to pool, link, and share data across the CTSAs offers the potential for informatics to drive the evolution of how clinical research is conducted, allowing for better informed clinical trial designs, and permitting new research questions to be addressed.

Data! Data! Data! … I can’t make bricks without clay!

From: The Adventures of the Copper Beeches (1892)

The temptation to form premature theories upon insufficient data is the bane of our profession.

From: The Valley of Fear (1914)

– Sir Arthur Conan Doyle