Skip to main content
search

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies.

Virtually all genome sequencing efforts in national biobanks, complex and
Mendelian disease programs, and medical genetic initiatives are reliant
upon short-read whole-genome sequencing (srWGS), which presents challenges
for the detection of structural variants (SVs) relative to emerging
long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in
large-scale genomics initiatives, we sought to establish expectations for
routine SV detection from this data type by comparison with lrWGS assembly,
as well as to quantify the genomic properties and added value of SVs
uniquely accessible to each technology. Analyses from the Human Genome
Structural Variation Consortium (HGSVC) of three families captured ~11,000
SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly.
Detection power and precision for SV discovery varied dramatically by
genomic context and variant class: 9.7% of the current GRCh38 reference is
defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of
deletions that were specifically discovered by lrWGS localized to these
regions. Across the remaining 90.3% of reference sequence, we observed
extremely high (93.8%) concordance between technologies for deletions in
these datasets. In contrast, lrWGS was superior for detection of insertions
across all genomic contexts. Given that non-SD/SR sequences encompass 95.9%
of currently annotated disease-associated exons, improved sensitivity from
lrWGS to discover novel pathogenic deletions in these currently
interpretable genomic regions is likely to be incremental. However, these
analyses highlight the considerable added value of assembly-based lrWGS to
create new catalogs of insertions and transposable elements, as well as
disease-associated repeat expansions in genomic sequences that were
previously recalcitrant to routine assessment.

Close Menu