The NIH Common Fund Data Ecosystem

The Common Fund supports a number of Data Coordinating Centers (DCCs), such as the Kids First Data Resource Center, that provide curated data derived from hundreds of studies and samples collected from thousands of human subjects. An incredible diversity of datatypes has been generated at the genomic, expression, proteomic, metagenomic, and imaging levels, and the DCCs support a tremendous range of scientific discovery efforts.

However, the present ability of a clinical or biomedical researcher to use the resources generated by the Common Fund is poor. It is difficult to search across all of the Common Fund data sets, and the resources are not readily usable in combination. The individual DCCs also need support for enhanced protected data access, long term data storage, training, interconnection with flexible data analysis platforms, and data and data portal availability past the end of the Common Fund Program lifecycle.

The Common Fund Data Ecosystem (CFDE) was established in early 2019 to address challenges faced by end users as well as the DCCs themselves. To assist the Common Fund DCCs, the CFDE supports individual DCC needs with targeted investments in interoperability, authentication/access to protected data, training, program lifecycle support, and evaluation of practical barriers to data Findability, Accessibility, Interoperability, and Reusability (FAIR). The CFDE also coordinates a monthly virtual “cross-pollination” seminar to connect DCCs across the Common Fund and beyond.

A key investment by the CFDE is in cross-DCC data discovery. Each of the DCCs host many assets (data files) – e.g., genomic sequence, metagenomic data, RNA-seq, physiological and metabolic data – and it is hard to discover these assets across DCCs. Moreover, information describing the contents of the files is not available in a standardized format. This prevents DCCs from making use of each other’s data, makes the data less discoverable by others, and challenges interoperability. To improve federation, the CFDE has created a central portal with a collection of inventories derived from data that are being hosted by the DCCs. The portal is still under development, but it will eventually describe all the assets at each DCC and make them discoverable via this centralized interface.

The advantage of this approach is that formation of the ecosystem does not require the data assets themselves be available via a central repository: only the inventories describing those assets are centralized. Cataloging all of the Common Fund assets is a simple and effective means of liberating data from what would be many siloed repositories, and therefore greatly increases the FAIRness of all Common Fund data. This form of data federation can also be extended to programs funded by other institutes, and easily linked to other NIH ecosystems: once an inventory system is available, it can be used by anyone.

The CFDE is also working with Seven Bridges Genomics to connect the portal to their Cavatica platform, in order to support custom data analysis workflows. Cavatica is a Seven Bridges product that provides a user-friendly interface suitable for beginner and intermediate level users to conduct bioinformatics analysis with Kids First data. Cavatica provides a graphical user interface to easily access Kids First data or import files for use in a visual editor that enables customizable analysis workflows using a point and click interface. The Cavatica workbench is designed to be used by clinicians or non-bioinformatics researchers who may not be well versed in command line or software programming. For more advanced users with programming experience, Cavatica also offers the ability to construct new tools and pipelines.

Developers at Cavatica are currently funded under the auspices of the CFDE to tie their interface directly to the CFDE portal. Initial implementation for this system is expected by the end of 2021, and will be designed to enable users to create shopping cart lists of data from the Common Fund DCCs, import those files into the Cavatica workbench, and to perform analysis using their system.

The CFDE is also building a training program in partnership with Kids First and the other Common Fund DCCs to enable end users to make use of the CF data sets, in order to accelerate basic and clinical research. This training program, available at https://training.nih-cfde.org/, will support a wide range of users with guides to using CFDE technologies as well as specific DCCs. Our existing training includes a CFDE portal guide as well as information on how to use the Kids First portal, and will soon be expanded to include data analysis on Cavatica.