Highly Accurate SNP and Indel Calling on PacBio CCS with DeepVariant

Alexey Kolesnikov Pi-Chuan Chang AuthorsJason Chin Andrew Carroll Authors

In this blog we discuss the newly published use of PacBio Circular Consensus Sequencing (CCS) at human genome scale. We demonstrate that DeepVariant trained for this data type achieves similar accuracy to available Illumina genomes, and is the only method to achieve competitive accuracy in Indel calling. Early access to this model is available now by request, and we expect general availability in our next DeepVariant release (v0.8).

Editorial Note: This blog is published with identical content here and on the Google DeepVariant blog. Re-training of DeepVariant and accuracy analyses were performed by Alexey Kolesnikov, Pi-Chuan Chang, and Andrew Carroll from Google. Sequence context error analysis was performed by Jason Chin of DNAnexus.

PacBio Circular Consensus and Illumina Sequencing by Synthesis

The power of a sequencing technology (e.g. accuracy, throughput, and read length) is determined by its underlying biochemistry and physical measurement. Each read in Illumina sequencing by synthesis (SBS) corresponds to clustered copies of the same DNA molecule. The consensus of an SBS cluster provides high accuracy, but molecules in the cluster go out of phase longer in the read, ultimately limiting read lengths.

The ability of PacBio’s single molecule real-time (SMRT) sequencing to measure a single DNA molecule allows it to escape this limitation on read length (important in applications like genome assembly, structural variation, and in difficult regions). However, sampling a single molecule without a consensus is more error prone, with base error rates of 10-15%.

PacBio CCSPacBio CCS builds a consensus on the same base. A stretch of DNA with a controlled length (e.g. 15,000 bases) is linked by known adapters. Sequencing  DNA multiple times provides the best of both worlds: a long read length and a measurement, reaching 99% per-base in a consensus read. This promises a single data type strong for both small variant analysis and structural variation.

Although the PacBio CCS base error rates are low, the sequence context of the errors differs from Illumina’s, and variant callers need modification to perform optimally. Instead of being coded by humans, DeepVariant learns which features are important from the data. This unique attribute allows it to be quickly adapted to PacBio CCS by re-training on this data.

Re-Training DeepVariant for PacBio CCS

Training DeepVariant involves starting from a model checkpoint and showing it labeled examples. This changes the weights of the model over time. We started from a DeepVariant Illumina WGS model and re-trained this with PacBio CCS data, excluding chromosome 20 to allow this to be used for independent evaluation. The PacBio CCS reads were generated from HG002, which has a truth set available from Genome in a Bottle and is the basis for training. (To learn more, see this walkthrough on training).

SNP Accuracy CCSFigures 2A and 2B show the SNP and Indel performance on PacBio CCS for DeepVariant after retraining for CCS and GATK4 run with flags and filters chosen by PacBio to improve performance for CCS.

Indel Accuracy CCSThe difference for SNP calling between GATK4 and DeepVariant is similar to what we see with Illumina.  However, the gap in indel performance is substantial, highlighting the need to adapt existing methods. (Note the use of different y-axes) .

To put these accuracies into context, we compare to SNP and Indel F1 scores for 30x Illumina genomes from PCR-Positive and PCR-Free preparations. The PCR-Positive is the NovaSeq S1: TruSeq 350 Nano sample available on BaseSpace, evaluated on chr20. The PCR-Free is a 30x downsample of our WGS case study. Figure 3 places the accuracy of PacBio CCS as roughly in-between Illumina PCR-Free and PCR-Positive.

SNP CCS

 

Indel Calling CCSThese accuracies are based on the Genome in a Bottle confident regions. The superior mappability of the PacBio CCS reads likely means this accuracy can be achieved over more of the genome (including clinically important genes) than with short reads.

How Much PacBio CCS Coverage is Necessary

To understand how the accuracy of DeepVariant relates to coverage, we progressively downsampled from the 28x starting coverage, randomly using 3% fewer reads with each step.

SNP accuracy is quite robust to downsampling, down to a coverage of around 15x. DeepVariant’s SNP F1 at 13.7x coverage is 0.9957, exceeding GATK4’s F1 at 28x (0.9951).

DeepVariant SNP

Indel accuracy declines with a gradual, but noticeable slope as coverage drops from 28x. It crosses the threshold of 0.9 F1 at about 15x.

DeepVariant Indel Calling

Improving Calls by Adding Phased Haplotype Information

The ability to determine whether two nearby variants are present on the same DNA molecule (e.g. both on the copy inherited from the mother) or on different molecules is called phasing. Longer read lengths improve the ability to phase variants, as tools like WhatsHap demonstrate for PacBio reads.

PacBio uploaded CCS reads annotated with phase information using inheritance from the trio and 10X data. We incorporated this information by sorting the reads in the pileup based on their haplotype, which reorders all of the tensors (e.g. base, strand, MAPQ).

This might sound like a small change, but it may have a substantial impact on how information flows through DeepVariant’s Convolutional Neural Network (CNN). The lowest layers of a CNN see local information. Sorting reads by haplotype means that even at the lowest level, the network can learn that adjacent reads likely come from the same haplotype.

Haplotype DeepVariant CCSHaplotype sorting had a small positive impact on SNPs, improving F1 from 0.9986 to 0.9988. However, as Figure 4 shows, the effect for Indel F1 was large, increasing F1 from 0.9495 to 0.9720. Now that we know haplotype information has such a strong positive effect, we can consider how to add this using only the PacBio data.

Using Nucleus for Error Analysis

Jason Chin of DNAnexus (and formerly of PacBio) performed  several interesting analyses on top of the open-source Nucleus developed to simplify bringing genomics data into TensorFlow.  See this Jupyter Notebook for a hands-on demonstration using data from a public DNAnexus project to fetch the sequence context around false positive and false negative sites.

We encoded the sequence alignment of the flanking regions (totaling 65 bases) of each site as two 65 x 4 matrices. The matched, mismatched or missing bases in the alignments were encoded to the first 65 x 4 matrix. In this matrix, each column had 4 elements and counts the number of A/C/G and T bases of the reads that match the reference at a given location.  The inserted sequences relative to the reference were encoded into the second 65 x 4 matrix.

We collected the error context matrices and treated them like high dimension vectors. We applied common dimensionality reduction techniques to see if we could find common patterns around the erroneous sites. Indeed, intriguing clustering structures appeared when we applied T-SNE or UMAP (see the figure below).  While long homopolymer A or T sequence cause the majority of the errors, we observed other less trivial common patterns, e.g., di-nucleotide or tri-nucleotide repeats. We also discovered a set of reads that have approximately a common prefix that is corresponding to Alu repeats.

Figure 6. A number of repeats patterns identified by a UMAP embedding of the alignment vectors around the residue error sites.UMAP DeepVariant CCS

Future Work

The models generated for this analysis are currently available by request to those considering CCS for their workflows. We expect to make a PacBio CCS model fully available and supported alongside our Illumina WGS and exome models in the next DeepVariant release (v0.8).

Though we feel the current work is strongly compelling for use, we identified a number of areas for continued improvement. Currently, we have trained with examples from only one CCS genome on a single instrument, compared to 18 Illumina genomes from HiSeq2500, HiSeqX, and NovaSeq. Simply having more training examples should improve accuracy.

The ability to generate phasing information solely from the PacBio reads would provide another large gain. We are investigating whether similar approaches are possible to improve DeepVariant for Illumina data in variant-dense regions. Finally, hybrid Illumina-PacBio models are an intriguing possibility to explore.

We have continued to improve DeepVariant’s speed and accuracy, and we expect to achieve similar improvements on PacBio CCS data as it becomes widely used.

Requesting Early Access

To request early access to the PacBio CCS model generated for this work, you can email awcarroll@google.com. We expect general availability of this model alongside our Illumina WGS and Illumina exome models in our next DeepVariant release (v0.8).

The app is also available by request on DNAnexus. For access please email support@dnanexus.com. This app will be broadly available soon.

A Consensus of Scientific Expertise

The CCS manuscript investigates other applications: structural variant calling, genome assembly, phasing, as well as the small variant calling discussed here. This required bringing together many investigators with different specializations across both the wet lab and informatics.

All of these investigators play important roles in evolving PacBio CCS into wide application. We want to specially thank Billy Rowell from PacBio, who coordinated the small variant calling section, and Aaron Wenger, who coordinated the broader manuscript, and Paul Peluso and David Rank, who generated the CCS dataset. We also give special thanks to Jason Chin, who helped to bring our team into this investigation, and who has been responsible for advancing many cutting edge PacBio applications over the years.

PAG 2019: Pioneering Frontiers in Genome Assembly

DNAnexus is headed to San Diego! We’re excited to join over 3,000 leading genomic scientists in plant and animal research for the 27th International Plant & Animal Genome Conference, taking place January 12-16. The conference exhibition and symposium brings the community together to discuss recent advancements, ongoing projects, and future studies in the field. Learn more and register for the conference here.

Come by Booth #227 to learn about our exciting genome assembly projects, including the Vertebrate Genomes Project and the new maize population assembly involving 26 cultivars, and how DNAnexus can partner with you for fast, accurate, and cost-efficient reference-quality assembly. Join us for our activities listed below or schedule a meeting with a member of our team.

DNAnexus Talks

Jason ChinDeveloping Deep Learning Models and Reusable Machine Learning Workflows for Genomics in the Cloud: From Single Cell Images to Variant Effects
Saturday, January 12, 8:50am – 9:05am
Location: California Room

Jason Chin, PhD, Senior Director of Deep Learning at DNAnexus, discusses how new advances in machine and deep learning are being leveraged to perform large-scale alignments, variant calling and more efficiently process plant and animal genomes. Jason will focus specifically on how utilizing Docker, Conda and JupyterLab simplifies interactive and reproducible research workflows.

Chai FungtamassanPragmatic Solutions for Scaling your Analysis: Machine Learning, Imaging, Containers, Clouds and APIs
Monday, January 14, 4:00pm – 5:30pm
Location: Towne Room

Join Arkarachai Fungtammasan, Ph.D. (Chai), DNAnexus Scientist at this workshop centered around popular toolkits for machine learning. The session includes a guided tutorial on PyTorch and includes free-form discussion and Q&A.

Sam WestreichExploring a Landscape of Genetic Variation in Virtual Reality
Wednesday, January 16, 12:00pm – 12:15pm
Location: California Room

Sam Westreich, PhD, DNAnexus Microbiome Scientist, will showcase BigTop, a fully immersive data exploration experience that uses virtual reality to examine genomic data. Sam will demonstrate how scientists can use the open source to view multiple supported datasets.

Customer Talks

Iowa StateAssembly and Comparative Genomic Analysis of the Maize NAM Founders
Speaker: Matthew B. Hufford from Iowa State University
Sunday, January 13, 8:40am – 9:00am
Location: Golden West Room

Whole Genome Assembly and Annotation of the Maize NAM Founders
University of GeorgiaTuesday, January 15 at 2:30pm – 2:42pm
Location: California Room

Kelly Dawe, PhD, & Jianing Liu from Dawe Lab at the University of Georgia will discuss how their study sequencing and assembling the complex and genetically diverse maize genome. The team utilizes PacBio, BioNano and Illumina sequencing to assembly the B73 maize inbred genome and along with 25 other maize inbreds. The talk is part of the Gramene workshop examining community approaches for improving structural annotations of genes and transposable elements.

VGPVertebrate Genome Project (VGP) / Genome 10K (G10K) Planning Meeting
Wednesday, January 16 at 9:30am – 3:30pm
Location: Sunrise Room

Join the planning session dedicated to coordinating efforts for upcoming VGP & G10K projects. The workshop is open to all PAG attendees, so come learn how to join the effort to generate reference-quality genome assemblies of all 66,000 extant vertebrate species.

2018 Highlights: Expanding Beyond Secondary Analysis and Scaling up Collaboration

We can’t believe it’s the end of the year already. 2018 has been extremely busy and productive year not only here at DNAnexus, but for the genomics industry at large. As customers’ research projects grew in size and scope, we launched new products and solutions designed to solve their toughest challenges around scale and collaboration.

Here are a few of our favorite highlights from 2018:

Expanding to Support Clinical Trials & Translational Informatics

We launched our Clinical Trials Solution (CTS) earlier in the year, to streamline the use of next-generation sequencing (NGS) data for customers looking to integrate genomics and other omics data into regulated clinical trials. Almac Diagnostics leverages the DNAnexus CTS for genome-based biomarker delivery in clinical trials, as well as other pharmaceutical companies integrate NGS data into their clinical trials.

More recently, we unveiled DNAnexus Apollo™, our advanced platform for multi-omics and clinical data science exploration, analysis, and discovery. Pharmaceutical research and development teams can leverage DNAnexus Apollo in their translational informatics research to rapidly test multiple hypotheses and gain valuable insight into mechanisms of action, biomarkers, and targets.

Collaborative Genomics Platforms

We partnered with St. Jude and Microsoft to create St. Jude Cloud which, along with being an online data-sharing and collaboration platform, is also the world’s largest public repository for pediatric cancer genomics data. St. Jude Cloud offers unique analysis tools and visualizations in a secure cloud-based environment. St. Jude was recognized last week as a 2019 Digital Edge 50 and Ones to Watch award winner for their significant impact in fostering global research collaboration through the St. Jude Cloud platform. Learn more about this groundbreaking effort through interviews with the team and learn how to access and utilize the online tools.

The Melbourne Genomics Health Alliance (MGHA) also launched the clinical genomics platform, GenoVic, that allows all 10 members (healthcare and biomedical research organizations) to access and collaborate on projects. This achievement allows the alliance to utilize shared data sets and tools available on GenoVic, leveraging DNAnexus for their cloud genome informatics.

Partnerships Fueling Innovation

At the start of the year, we announced our support for the Vertebrate Genomes Project (VGP) which aims to assemble and analyze genomes of all vertebrate species. Their work resulted in the release of the first 15 high quality reference genomes being released on the GenomeArk. The 15 genomes created through the VGP demonstrate the strength of the consortium as they continue on the next series of vertebrate species.

Baylor College of Medicine’s Human Genome Sequencing Center (HGSC) recently announced their participation in the NIH’s All of Us Research Program. The large-scale effort aims to sequence more than 1 million genomes across the United States to build the most diverse biomedical data resource of its kind. The HGSC will be one of three centers dedicated to generating the sequence and providing clinical reports along with Johns Hopkins Genomics Center for Inherited Disease Research (CIDR) and University of Texas Health Science Center at Houston’s School of Public Health. DNAnexus is proud to serve as the HGSC’s genomic cloud computing partner in this endeavor. Learn more about HGSC’s sequencing infrastructure in this webinar with Will Salerno, Director of Genome Informatics.

DNAnexus Platform Enhancements

We are excited to announce our achievement of FedRAMP Moderate ATO status. The Federal Risk and Authorization Management Program (FedRAMP) authorization, sponsored by the Department of Health & Human Services, enables Federal agencies to rapidly and securely integrate cloud-based biomedical informatics into their research and services. This status reconfirms our commitment to the security and compliance of our customers’ data, both for federal and nonfederal users. Learn more about what this new authorization means for you by watching our webinar.

The DNAnexus Platform also expanded to include a number of new apps, features, and integrations. We started first by offering Google’s DeepVariant designed to call genetic variants from next-generation sequencing data using deep neural networks. Customers can now access CWL and WDL in their DNAnexus workflows. Licensed customers can also now utilize features Smart Reuse and Audit Trail, saving them time in reusing previous job results and tracking organizational activity. Log in to your account to see the new changes.

Favorite Scientific Blogs

The DNAnexus Blog continues to host a number of articles from our scientific experts. Articles span a variety of subjects from around the genomics industry as authors shared industry insights, opinions and references.

Here are some of our favorite blog posts from this year:

We want to thank our customers and partners for their role in all of the accomplishments from this past year. We can’t wait to share all of our exciting upcoming news in 2019, but until then have a safe and happy holidays.