The Kids First Long-Read Sequencing Pilot Program: Applying New Technologies in Studying Childhood Cancer and Structural Birth Defects-title-h1
The Kids First Long-Read Sequencing Pilot Program: Applying New Technologies in Studying Childhood Cancer and Structural Birth Defects
Kiran Garimella, DPhil; Broad Institute
Shawn Levy, PhD; HudsonAlpha Institute for Biotechnology
The human genome is complex, and very big. According to the National Human Genome Research Institute, a single chromosome can range in size from 50 to 300 million base pairs (DNA molecules, always paired together as A-T or C-G, that form two twisting strands). The entire human genome contains over 3 billion base pairs.
The ability to decode and examine the full genetic information that drives the behavior of cells is incredibly important to medical science, and was made possible nearly 2 decades ago with the completion of the Human Reference Genome, a standardized reference and aggregate of all the genomic information and diversity found in human genes. When studying cancers, birth defects, and rare diseases, scientists can compare gene sequences from an individual against this standardized reference genome to detect variation or changes in the DNA sequence, which could then point to possible causes of these ailments and disorders.
It’s not yet feasible to generate a single, continuous sequence (or “read”) of a person’s entire genome. Depending on the sequencing center and the technology being used, the standard for research is to conduct “short-reads” of around 100-150 base pairs at a time, and then to computationally stitch these together to form a whole genome sequence (WGS).
The short-reads are produced in a random order and need to be assembled back into a genome in order to try to identify the differences of interest to researchers. This is similar to completing a jigsaw puzzle. One needs to put all the pieces into place in order to see the whole picture. But with short-read sequencing, that puzzle can have hundreds of millions of pieces to assemble.
Despite this challenge, short-read sequencing has been the most readily available and cost-effective way to examine genomic data for biomedical research. In the roughly 15 years that short-read sequencing has been available, yield and accuracy have steadily improved. Sequencing costs have fallen tremendously, enabling larger scale biological investigations. Greater public availability of data has enabled scientists to develop new tools that can extract ever-increasing value from the data. Short-read sequencing is the standard method for gene sequencing because it has definitively established itself as a flexible tool for many jobs.
But that doesn’t necessarily mean it’s the right tool for every job.
A fundamental limitation of short-reads is that they’re…short, and thus have limited capabilities. They perform poorly at detecting large changes in DNA sequence (so-called structural variants, including deletions, duplications, inversions, or translocations which tend to be longer than the short-reads themselves), and at detecting variations in repetitive regions which make the short read puzzle assembly much more difficult.
Long-reads are produced by special types of sequencers. The long-reads are many thousands of base pairs long and can be used to illuminate places in the genome that short-reads can’t access or assemble. They help researchers explore structural variations in the genome and improve assemblies (it’s easier to assemble a puzzle cut into 100 large pieces than one cut into 1,000 small pieces).
Until a few years ago, long-read sequencing was still too expensive, too inaccurate, and too hard to use for regular sequencing of lots of human genomes. But that has recently started to change in a big way.
Long-read sequencing vendors (Pacific Biosciences - PB, and Oxford Nanopore - ONT) released versions of their instruments that massively increased long-read yield over previous iterations. Accuracy increased substantially and sequencing costs fell by an order of magnitude. Greater data availability once again enabled the development of novel software methods to capitalize on the improvements and offer new analysis capabilities. In concert, these changes are enabling long-read sequencing data to be generated and processed at the scale required for human patient studies.
Now, under its long-read sequencing pilot program, the NIH Common Fund’s Gabriella Miller Kids First Pediatric Research Program (Kids First) is working to leverage these rapidly advancing technologies to uncover genetic structural variation underlying childhood cancers and structural birth defects.
The two Kids First genome sequencing centers, HudsonAlpha Institute for Biotechnology and Broad Institute, were founded on the premise of bringing the most appropriate and advanced sequencing technologies, in a highly supported way, to the investigators selected in the Kids First program. The long-read pilot program is an extension to the ongoing short-read sequencing efforts to provide Kids First investigators and the wider research community with the best available resources to reveal novel insights or higher resolution data for these very unique cohorts and samples. There are currently seven research studies participating in the Kids First Long Read pilot program.
Led by Principal Investigator Dr. Sharon Plon of Baylor College of Medicine and sequenced by HudsonAlphia is a project to conduct long-read sequencing of the BASIC3 cohort. The project hopes to improve our collective understanding of pediatric cancer susceptibility through the analysis of germline structural variation using long-read whole genome sequencing.
Principal Investigator Dr. Bruce Gelb of the Icahn School of Medicine at Mount Sinai is leading a study involving long-read whole genome sequences generated at the Broad Institute, focused on congenital heart defects (CHD). His team hopes to use long-read sequencing to identify structural variants and gene repetitions that could be a contributing factor in the development of CHD in some patients.
Continued advancements in read quality, read length, and cost efficiency in data production have brought us to an interesting turning point. The Oxford Nanopore technology continues to improve chemistry and read accuracy while maintaining the ability to generate reads that are hundreds of thousands to millions of base pairs long. Pacific Biosciences continues to improve the yield and quality of their platform. And the Circular Consensus Sequencing (CCS) method from Pacific Biosciences can be thought of as combining the best features of traditional short-read sequencing where read accuracy is improved by detecting the same sequencing multiple times. CCS allows thousands of base pairs to be detected in a single read, but the read is detected 7 to 12 times in replicate, developing a highly accurate sequencing for that segment.
One of the challenges associated with long read technologies is that they each require very high-quality DNA samples. Before being loaded onto a sequencer machine, the DNA molecules must be extracted out of cells, a procedure of physical and chemical disruptions that can also break the delicate chains of nucleotides. Too rough of an extraction could tear the DNA into tiny pieces not suitable for long read sequencing. The Kids First DRC and its sequencing center partners are working with the researchers contributing samples to be sure these are of sufficient quality to maximize the benefits of long read technology.
The Gabriella Miller Kids First Program’s commitment to supporting long-read sequencing is helping to bring focus to areas of the genome or variant classes that may be under-reported or unable to be detected with a single technology. When combined, long-read and short-read efforts will leverage each other to advance discoveries of pathways underlying these pediatric conditions. It is important to note that one technology will not replace the other or substitute fully for the other. This is an opportunity to bring more tools to the investigators and toolbox.