PMWC 2018: Leveraging Multi-Omic Datasets in Discovery & Clinical Trials

The Precision Medicine World Conference kicks off next week at the Computer History Museum in Mountain View, California. The program traverses innovative technologies, thriving initiatives, and clinical case studies that enable the translation of precision medicine into direct improvements.

Please join us for a lively panel discussion around scalable infrastructure/platforms that integrate next-generation sequencing (NGS) and other data (e.g. phenotypic) to power discovery in Pharma and the clinic.

Title: Scalable NGS Infrastructure/Platforms
Talk Details: Track 1 – Monday, January 22 at 10:30am
Moderator: Brady Davis, Chief Strategy Officer, DNAnexus

Panel Speakers:

  • AstraZeneca/MedImmune – David Fenstermacher, VP BioInformatics
  • Sutter Health – Greg Tranah,  Director, CPMC Research Institute, Adjunct Professor Dept. of Epidemiology & Biostatistics, UCSF
  • Carol Franc Buck Breast Cancer Center at UCSF– Laura Esserman, Director
  • City of Hope – Sorena Nadaf, SVP & CIO

Abstract:

Health care providers increasingly require multi-omic datasets, including phenotypic data informed by genomic data. Such data needs to be obtained in an economically sustainable way and made available on an agile user-friendly platform so that these data may inform clinical care and lead to health improvements.

Pharmaceutical companies (“Pharmas”) are interested in obtaining datasets containing phenotypic/clinical and genomic information generated from patient cohorts of specific disease areas. Such datasets can help Pharma researchers identify drug targets or find biomarkers, validate hypotheses related to the interaction of genomics with disease or with specific therapies, and identify candidate populations for future clinical trials. Payers are also interested in the outcomes related to new discoveries and therapies in order to reimburse for these treatments.

This session will focus on how both healthcare provider organizations, Pharmas and Payers are working toward solving these complex and challenging problems from a technical and business model perspective.

 

Evaluating the Performance of NGS Pipelines on Noisy WGS Data

In this blog post, we explore variability in the quality of sequencing runs. We present a method which allows the generation of benchmarking sets of real sequence data across arbitrary quality ranges. Finally, we present benchmarks of popular variant calling pipelines across these diverse quality sets.

Introduction

The recent generation of gold-standard genome datasets – such as the Genome in a Bottle (GIAB) and Platinum Genomes – has given greater understanding about the quality of sequencing. Competitions held on PrecisionFDA have driven the innovation and refinement of new methods. Sentieon, Edico, and DeepVariant, each a winner of one or more of Consistency, Truth, and Hidden Treasures Challenges, have proven the power of standards to drive innovation. These gold-standard datasets used to construct truth sets and challenges were produced with great care and attention to quality.

We wondered how directly these sets reflect common, real-world sequencing data. “Quality” of sequencing is complex – factors such as library complexity, coverage, contamination, index hopping, PCR errors, and insert size distribution are all important. For our exploration, we use a quantifiable, unambiguous, and straightforward measure – the sequencer-reported base quality.

FASTQ base qualities estimate the probability that a base call (A, T, G, or C) is in error. These errors make reads more difficult to correctly map and produce noise that variant callers must contend with. Monitoring of base qualities is often a component in QC.

We score each pair of forward and reverse reads with the expected value for the number of errors present in the pair. The combination of these values for each read pair in a sequencing run creates a distribution with a mean and standard deviation for errors.

Comparing GIAB, PrecisionFDA Sets to General Data

First, we investigate the Illumina HiSeq2500 contributed to GIAB. These data, along with contributions of PacBio, Complete, 10X, BioNano, Oxford Nanopore, and Ion Torrent were used to create the GIAB truth sets.  

As Figure 1 indicates, there is some variability in quality – HG003 samples have about 30% less of an error rate than HG001. However, overall base accuracy is high.

We selected 3 readily available HiSeq2000/HiSeq2500 sets from SRA and performed the same calculation.  As Figure 2 illustrates, other datasets range from modestly higher (~3x) to dramatically higher (~20x) error rates. The blue dots in this figure are the same as all of the dots in Figure 1.


We applied a similar investigation to datasets from the HiSeqX. In Figure 3, we compare predicted error rates for the 2016 Garvan WGS dataset contributed to the PrecisionFDA Truth challenge, several HiSeqX replicates produced by Illumina and available on BaseSpace, several HiSeqX runs performed by Academic Medical Centers, and the first publicly available 2014 HiSeqX run from Garvan that we noted as difficult for Indel calling in a prior blog post.

With these HiSeqX sets, there is a 5-fold variability in predicted error rate between the highest and the lowest samples.  How do differences of this magnitude relate to the quality of variant calls from a sample? Answering this question is not easy since the diversity of datasets are not on the genomes (HG001, HG002, or HG005) that have truth sets for. To address this, we have developed a method called Readshift, which allows us to resample existing runs to create real datasets of diverse read quality.

Readshift

Readshift uses biased downsampling from a dataset of larger coverage to create a shifted quality distribution at lower coverage. (See Figure 4 for a conceptual example.)

It does so by modeling the original and final samples as normally distributed and calculating what the probability for inclusion of each read would be required to reach the desired distribution.

Readshift does not force the final distribution to be normally distributed. Rather, it acts as a transformation which allows shifting of a distribution in a way that preserves the shape of the original distribution.

Importantly, the data produced by Readshift is not simulated – these are real reads produced by a real sequencer and thus carry all of the intricacies, biases, and idiosyncrasies associated with real data. No individual base or quality value is altered.

All of the code for Readshift is released with an open source license on Github  – https://github.com/dnanexus/Readshift. The FASTQ data in this investigation are available in a public DNAnexus project for download or general use, as are the Readshift apps.

Shifting HiSeq2500, HiSeqX, and NovaSeq Datasets

Several high-coverage datasets exist for HG001 on the HiSeq2500, HiSeqX, and NovaSeq platforms. We took 200X HiSeq2500 data submitted for GIAB, 200X HiSeqX data available on BaseSpace, and 350X Novaseq data available on BaseSpace and applied Readshift to increase the predicted errors in reads by 0.5, 1.0, 1.5, and 2.0 standard deviations relative to the prior mean. To confirm that no artifacts are introduced by the shift machinery, we also applied Readshift with a shift of +0.0 standard deviations.

This range roughly covers the difference between the best and worst HiSeqX sets shown above. The magnitude of the possible shift is determined by the amount of starting coverage – higher coverage allows larger shifts without clipping the distribution.

Figure 5 shows the distribution of read errors after the shift. Though we use the probability density function of the normal to do the transformation, shifting qualitatively preserves the distribution shape. Both original and final distributions are left-leaning curves. At the higher degrees of shift, much of the decline in read quality appears driven by reads with a strong decline in quality toward the 3’ end, or reads with a bad mate.

Evaluating WGS Pipelines on Noisy Datasets

We evaluate several popular WGS variant calling pipelines on this shifted data starting from 35X coverage FASTQ files. All analyses are done against the hg38 reference genome. Accuracy is assessed on the confident regions using the Genome in a Bottle Truth Sets, with the evaluation performed using the same method that we use in PrecisionFDA.

Figure 6 demonstrates that tools behave differently when used to call increasingly noisy data . Freebayes begins to suffer immediately, while Strelka2 suffers once the shift exceeds 0.5 standard deviations. Even the most resistant programs perform significantly worse (GATK 100%, DeepVariant 68% more errors) at a shift of 2.0 standard deviations. Sentieon avoids downsampling to achieve reproducibility, but otherwise matches GATK3.

The SNP results for HiSeqX follow a similar pattern as indicated in Figure 7.

For NovaSeq (Figure 8), though the data is qualitatively similar to the HiSeq2500 and HiSeqX runs, other complexities occur. Generally, low-quality runs took significantly longer to finish for this data. Some programs either failed or produced results that suggest they did not call several regions for high coverage. The NovaSeq uses 2-color chemistry which does not enable it to indicate an unknown base as N. This may make reads with more errors hard to uniquely map, causing an increase in secondary alignments and regions of unmanageable coverage.

This suggests that filtering of NovaSeq data may be more important than with previous sequencers in low-quality runs. (GATK3+1.0, GATK4+1.0, Strelka+1.5, and DeepVariant+2.0 were impacted).

Indels

The results with Indel calling was different than those for SNPs. Apart from Freebayes, none of the variant callers suffer an accuracy penalty on the lower-quality data.

In the HiSeqX data (Figure 10), the Indel error rate is much higher even in the original data. This is the same phenomenon we observed in the early 2014 Garvan HiSeqX run we found in a prior blog post; however, these are different runs. It is possible that many early HiSeqX runs have this issue, or that some other aspect of the preparation is at fault.

Interestingly, though Indel callers arerobust against these extreme shifts in read quality, they suffer strongly from whatever occurred in the HiSeqX runs.

The NovaSeq data (Figure 11) is qualitatively similar to the HiSeq2500 results.

The fact that the callers exhibit greater robustness against Indel errors could be an opportunity to improve SNP calling. Although the fact that the definition of base quality values map more directly to SNP events than Indels, reads in the higher shifts often have long stretches of low-quality bases toward the 3’ end. Based on this, it seems likely that something in the way tools handle indels mitigates the low-quality reads. This could be a heavier reliance on haplotype construction, Indel realignment, or local assembly to resolve Indels. It could also be that since Indels span multiple bases, there is a greater signal to distinguish random errors from meaningful differences. If this is the case, this ability to gain nearby corroborating evidence might be achievable by phasing nearby variants or looking whether nearby potential errors co-vary.  

Increased Computational Burden on Low Quality Sets

As Figure 12 and Figure 13 indicate, lower quality samples carry a higher computational burden. For BWA, reads become harder to uniquely map, which also causes an increase in secondary alignments.

This problem is more severe for NovaSeq data, perhaps because 2-color chemistry lacks a way to designate a base as “N” (“no color” corresponds to “G”) for NovaSeq.  

For GATK3 HaplotypeCaller, this can cause samples to require 3x as much compute, even at moderate shifts. We believe that GATK3 is affected because secondary align reads pile up in poly-G/poly-C regions of the genome, and this generates a large number of candidate haplotypes to consider.

We think that many programs may suffer similar impacts, particularly any process which considers across candidate events at low confidence (e.g. many possible haplotypes) and selects the best choices. This may be worse for methods in somatic calling, or detection of events in cell-free DNA. Even in pristine reads with a 3/1000 error rate, any given mismatch is 3x more likely to be a sequencing error than the human variation rate. Germline callers manage this because the signal from germline events is consistent. Somatic callers must consider candidates at low frequencies or risk losing rare, but impactful, events. We hope to discuss this and strategies for mitigation in a future post. Note that the strong majority of NovaSeq runs are just fine – and standard read filtering tools seem to work for those that have issues.

Conclusion

By providing this analysis, we hope to highlight the fact that not all sequencing runs are the same and that this is important to their analysis, the choice of tools for them, and how we interpret tool evaluations on gold standard data. By releasing both the code https://github.com/dnanexus/Readshift  to perform the shift and the data (https://platform.dnanexus.com/projects/F9K5zYQ0ZQBb5jJf6X2zZPv1/data/), we aim to help the community improve the accuracy and robustness of their methods. Readshift can be applied to any emerging benchmark as long as high coverage sequencing exists, such as the CHM1-CHM13 dataset developed by Heng Li et al.

This work raises a number of additional areas we have begun to investigate, including the best strategies to filter reads to protect against low-quality sets, the differences in impact on precision versus recall across tools and sequencing platforms, and the potential for ensembling analytical methods to shield against outlier sequencing runs.

If you are interested in collaborating with us on this work, we want to engage scientific partners. If you have sequence datasets on NA12878/HG001, HG002, or HG005, particularly from runs that turned out to be bad, and want to help us use them to make better benchmarking resources: Readshift only gets stronger with more data. Please reach out to us at readshift@dnanexus.com.

Dot: An Interactive Dot Plot Viewer for Comparative Genomics

Author: Maria Nattestad, Scientific Visualization Lead

 

 

 

Introduction

Next week, DNAnexus will be at the Plant and Animal Genome conference (PAG) in San Diego (booth 431). As part of an ongoing effort to expand our visualization capabilities, we will present an open-source tool called Dot that helps scientists visualize genome-genome alignments through a rich, interactive dot plot.

In addition to its scientific contribution, Dot encourages community development of new visualization tools by providing a template that can be used for new visualization tools in other areas of bioinformatics. This would allow bioinformaticians to focus on the bioinformatics and visualization without needing to master web programming intricacies such as reading data from local and remote servers, which is all handled by Dot’s modular and reusable inner workings.

Importance of Dot Plots

Constructing a genome assembly is fundamental to studying the biology of a species. In recent years, advances in long-read sequencing and scaffolding technologies have led to unprecedented quality and quantity of genome assemblies. Better reference genomes contribute to better gene annotations, evolutionary understanding, and biotech opportunities.

Comparing new assemblies to existing genomes of related species is crucial to understanding differences between organisms across the tree of life. Genome assemblies are never perfect and always have to be evaluated critically by comparing against other assemblies or reference genomes, whether of the same or a closely related species. Comparative genomics is also how assemblies of two species’ genomes can be compared and contrasted to look for features that represent functional differences or inform the study of their evolution.

The classic method for visualizing genome-genome alignments is the dot plot, which provides an excellent overview of alignments from the perspective of both genomes. Dot plots place the reference genome on one axis and the query genome (that is aligned against the reference) on the other axis. Alignments between the two genomes are placed according to their coordinates on both genomes. Whereas genome browsers (such as IGV and the UCSC Genome Browser) plot data in one dimension on one genome, dot plots use two dimensions to show alignments in two genomes’ coordinates spaces simultaneously. This is necessary when representing large genome alignment data where the query coordinates matter just as much as the reference coordinates for a particular alignment.

However, dot plots have barely changed in the past decade and are still generated from the command-line as static images, limiting detailed investigation. We decided to tackle this problem as an open-source science project at DNAnexus.

Introducing Dot

Here we present Dot, an interactive dot plot viewer that allows genome scientists to visualize genome-genome alignments in order to evaluate new assemblies and perform exploratory comparative genomics.

Dot supports the output of MUMmer’s nucmer aligner the most commonly used software method for aligning genome assemblies. A quick script called DotPrep.py converts the delta file to a more streamlined coordinates file with an index that enables Dot to read in more alignments in certain regions on demand.

Interactivity and features

Dot adds a number of useful features on top of the classic dot plot concept. The index enables a quick plot of an overview that includes the longest 1000 alignments. From here, users can zoom in to look at particular regions and load all the alignments for regions of interest.

In addition to showing alignments, Dot allows scientists to load annotations for either or both genomes to show additional context  (e.g. understanding how sequence differences map to gene differences). Annotation tracks are a common feature of one-dimensional genome browsers, but to translate this concept to the two-dimensional dot plot, we enable annotation tracks on both axes. This is a major benefit of Dot that makes it possible to compare gene annotations visually alongside the alignments of the DNA sequences.

Moreover, users can jump to the same region of the reference genome in the UCSC Genome Browser to quickly see additional context for a region of interest. This allows scientists to explore how known repetitive elements in the reference genome could potentially affect assembly quality in specific regions.

Details for developers

By leveraging D3 and canvas in JavaScript, Dot combines the benefits of interactivity with scalability, enabling scientists to explore large genomes. The UI on the right side panel is built using an open-source SuperUI.js [https://github.com/marianattestad/superui] plugin, and the input handling and basic page navigation is set up through a special VisToolTemplate [https://github.com/MariaNattestad/VisToolTemplate] plugin we developed to enable others to create new visualization tools more easily. We encourage developers to utilize and build on Dot and these open-source projects to create their own visualization tools. Dot is very modular and can be used as a template to build new visualization tools. The template handles complex and necessary components like reading input data files from various sources, thereby letting developers focus only on the visualization itself.

Dot is open source

Dot is free to use online at [https://dnanexus.github.io/dot/] and open source at [https://github.com/dnanexus/dot]. For DNAnexus users, there is a package available among the featured projects with (1) an applet for running MUMmer’s nucmer aligner that includes DotPrep.py, (2) a shortcut  to Dot to send files from DNAnexus quickly, and (3) example data and results.