New Data Resources Added to the Kids First Data Resource Portal

The NIH Common Fund-supported Gabriella Miller Kids First Data Resource Center (Kids First DRC) is a dynamic and ever-growing effort to create opportunities for investigators across the research and healthcare landscape. It aims to connect researchers all over the world to collaborate and share resources toward a better understanding of the genetic causes and links between childhood cancer and structural birth defects.

The Kids First DRC has presented new resources, and research studies empowered by Kids First data at the Kids First Fall Webinar and the annual meeting of the American Society of Human Genetics (ASHG). The goals were to raise awareness of the incredible potential for scientific discovery that is within reach for individuals at any career level, and of the many opportunities for collaboration as a result of Kids First partnerships across the NIH and investigators internationally.

You can read more about the Kids First DRC’s involvement at the ASHG 2022 Annual Meeting and Kids First Fall Webinar on our blog.

Several data expansions and functional improvements have been implemented within the Kids First Data Resource Portal, further expanding our collective capability for research breakthroughs on behalf of children everywhere.

New Data Released on the Kids First Portal

In September, the Kids First DRC released data from the Kids First study on Cornelia de Lange Syndrome (CdLS), led by Principal Investigator Dr. Ian Krantz of the Children’s Hospital of Philadelphia. CdLS is a developmental disorder often characterized by short stature, intellectual disability, and abnormalities in the bones of the arms, hands, and fingers. Specimens collected for this study were primarily sequenced at the Broad Institute, resulting in more that 7 TB of whole genome sequence (WGS) data available for analysis. These WGS data were derived from 373 study participants and 170 families, and are comprised of aligned reads, gVCFs, Variant Calls, and more. Family compositions within this study include trios, duos, and proband-only.

Nearly a year ago, initial data from the Kids First study on T-Cell Acute Lymphoblastic Leukemia (T-Cell ALL) were added to the portal. Led by Dr. David Teachey of the Children’s Hospital of Philadelphia, the study generated nearly 68 TB of whole genome sequence data. In October of 2022, a vast amount of new data were added to the dataset associated with this study. The number of data files expanded more than 10-fold, as the output from the Kids First Somatic Variant Workflow was added, bringing the total number of files to 24,254. 35 TB of RNA sequences and whole exome sequences (WXS) were also generated. The T-Cell ALL dataset now contains over 259 TB of data, which include aligned reads, annotated and masked somatic mutations, gene fusions, and variant calls from both tumor and normal samples.

Also in October, the Kids First study on Bladder Extrophy, Epispadias, Complex (BEEC), led by Dr. Angie Jellin of Johns Hopkins University, was released. Samples were sequenced at the Broad Institute, resulting in more than 9 TB of WGS data generated from 321 study participants and 134 families. The dataset contains Aligned Reads, Variant Calls, and gVCFs, derived from trio, duo, and proband-only family groupings.

New Analysis Workflows to Accelerate Scientific Discovery

With the addition of new data resources, the Kids First Data Resource Portal has grown to encompass more than 1.7 petabytes of multi-omic data to support research into conditions affecting children everywhere. However, this figure only illustrates a portion of the incredible power of this data resource. Paired with each of the datasets within the Kids First portal are a suite of analysis tools and features to enable groundbreaking discoveries faster than ever before. And built into the very scaffolding of each dataset are a range of workflows — essentially, pre-programmed algorithms to automate analysis across numerous research methodologies – to assist in the interrogation of disease from multiple angles.

Data added to the Kids First T-Cell ALL study were generated by Kids First data experts using the Somatic Variant Workflow, making the detection of variations within a patient’s genome much less burdensome. This set of software tools pinpoints genes within the patients’ leukocytes that have changed or malfunctioned, compared to the normal gene code found in their healthy white blood cells. This in turn enables researchers to shed new light on the possible root causes of tumor development in patients, as well as new gene targets for therapeutic intervention.

Data from the T-Cell ALL study were also generated by the RNA Sequencing Workflow, which allows users to see the amount of “expression” of every gene in the genome. This is yet another layer through which researchers can determine factors that cause normal cells to turn cancerous. Through this workflow, researchers can detect if specific sections of the genome are being copied correctly, too much, or not enough during the process of cell division, as well as whether sections of the genetic code become fused together during tumor development.

Data from the CdLS and BEEC studies were generated using the Joint Genotyping Workflow, which allows researchers to compare genetic variants across families in trio groupings – collections of genes from individual patients and both of their biological parents. This will allow for researchers to identify new variants (either de novo or spontaneous), assisting in tracking gene inheritance within a family.

Finally, each of the studies highlighted here also utilize the Kids First Aligent Workflow, a feature applied to all studies within the Kids First Data Portal to assist with the detection of germline variants. Used to examine normal cells collected from a patient’s blood and saliva, this workflow allows researchers to examine a child’s entire genome in one continuous sequence, and to compare it against the Human Reference Genome to detect anomalies. This, in turn, allows researchers to detect any differences in a child’s genome which could possibly point to disease predisposition as well as possible gene targets for new therapies.

Launching New Discoveries with the Power of Cloud Computing

The Kids First Data Resource Center represents an incredible number of opportunities for researchers at any level, working anywhere in the world. Each of the workflow described above is publicly available for researchers to use on datasets of their own, allowing them to combine their own genomic data with that of the Kids First Data Resource Center. And now, through the Kids First Cloud Credit Program, the ability to take full advantage of Kids First resources is easier to access than ever!

The Cloud Credits program enables researchers to conduct cloud-based analyses by accessing and utilizing the data and tools available through the Kids First Data Resource. By providing user credits at no cost to the individual, the program enables more researchers than ever to harness the scalable cloud computing power of Kids First’s analysis platform, CAVATICA, compressing analysis time from months to days. The platform also enables researchers to import their own data, to be analyzed through the Kids First Data Resource Center’s optimized bioinformatics workflows.

The Kids First Cloud Credits program is available to all researchers interested in using Kids First data resources. For more details, read the complete announcement here.