Highly Accurate SNP and Indel Calling on PacBio CCS with DeepVariant

Alexey Kolesnikov Pi-Chuan Chang AuthorsJason Chin Andrew Carroll Authors

In this blog we discuss the newly published use of PacBio Circular Consensus Sequencing (CCS) at human genome scale. We demonstrate that DeepVariant trained for this data type achieves similar accuracy to available Illumina genomes, and is the only method to achieve competitive accuracy in Indel calling. Early access to this model is available now by request, and we expect general availability in our next DeepVariant release (v0.8).

Editorial Note: This blog is published with identical content here and on the Google DeepVariant blog. Re-training of DeepVariant and accuracy analyses were performed by Alexey Kolesnikov, Pi-Chuan Chang, and Andrew Carroll from Google. Sequence context error analysis was performed by Jason Chin of DNAnexus.

PacBio Circular Consensus and Illumina Sequencing by Synthesis

The power of a sequencing technology (e.g. accuracy, throughput, and read length) is determined by its underlying biochemistry and physical measurement. Each read in Illumina sequencing by synthesis (SBS) corresponds to clustered copies of the same DNA molecule. The consensus of an SBS cluster provides high accuracy, but molecules in the cluster go out of phase longer in the read, ultimately limiting read lengths.

The ability of PacBio’s single molecule real-time (SMRT) sequencing to measure a single DNA molecule allows it to escape this limitation on read length (important in applications like genome assembly, structural variation, and in difficult regions). However, sampling a single molecule without a consensus is more error prone, with base error rates of 10-15%.

PacBio CCSPacBio CCS builds a consensus on the same base. A stretch of DNA with a controlled length (e.g. 15,000 bases) is linked by known adapters. Sequencing  DNA multiple times provides the best of both worlds: a long read length and a measurement, reaching 99% per-base in a consensus read. This promises a single data type strong for both small variant analysis and structural variation.

Although the PacBio CCS base error rates are low, the sequence context of the errors differs from Illumina’s, and variant callers need modification to perform optimally. Instead of being coded by humans, DeepVariant learns which features are important from the data. This unique attribute allows it to be quickly adapted to PacBio CCS by re-training on this data.

Re-Training DeepVariant for PacBio CCS

Training DeepVariant involves starting from a model checkpoint and showing it labeled examples. This changes the weights of the model over time. We started from a DeepVariant Illumina WGS model and re-trained this with PacBio CCS data, excluding chromosome 20 to allow this to be used for independent evaluation. The PacBio CCS reads were generated from HG002, which has a truth set available from Genome in a Bottle and is the basis for training. (To learn more, see this walkthrough on training).

SNP Accuracy CCSFigures 2A and 2B show the SNP and Indel performance on PacBio CCS for DeepVariant after retraining for CCS and GATK4 run with flags and filters chosen by PacBio to improve performance for CCS.

Indel Accuracy CCSThe difference for SNP calling between GATK4 and DeepVariant is similar to what we see with Illumina.  However, the gap in indel performance is substantial, highlighting the need to adapt existing methods. (Note the use of different y-axes) .

To put these accuracies into context, we compare to SNP and Indel F1 scores for 30x Illumina genomes from PCR-Positive and PCR-Free preparations. The PCR-Positive is the NovaSeq S1: TruSeq 350 Nano sample available on BaseSpace, evaluated on chr20. The PCR-Free is a 30x downsample of our WGS case study. Figure 3 places the accuracy of PacBio CCS as roughly in-between Illumina PCR-Free and PCR-Positive.

SNP CCS

 

Indel Calling CCSThese accuracies are based on the Genome in a Bottle confident regions. The superior mappability of the PacBio CCS reads likely means this accuracy can be achieved over more of the genome (including clinically important genes) than with short reads.

How Much PacBio CCS Coverage is Necessary

To understand how the accuracy of DeepVariant relates to coverage, we progressively downsampled from the 28x starting coverage, randomly using 3% fewer reads with each step.

SNP accuracy is quite robust to downsampling, down to a coverage of around 15x. DeepVariant’s SNP F1 at 13.7x coverage is 0.9957, exceeding GATK4’s F1 at 28x (0.9951).

DeepVariant SNP

Indel accuracy declines with a gradual, but noticeable slope as coverage drops from 28x. It crosses the threshold of 0.9 F1 at about 15x.

DeepVariant Indel Calling

Improving Calls by Adding Phased Haplotype Information

The ability to determine whether two nearby variants are present on the same DNA molecule (e.g. both on the copy inherited from the mother) or on different molecules is called phasing. Longer read lengths improve the ability to phase variants, as tools like WhatsHap demonstrate for PacBio reads.

PacBio uploaded CCS reads annotated with phase information using inheritance from the trio and 10X data. We incorporated this information by sorting the reads in the pileup based on their haplotype, which reorders all of the tensors (e.g. base, strand, MAPQ).

This might sound like a small change, but it may have a substantial impact on how information flows through DeepVariant’s Convolutional Neural Network (CNN). The lowest layers of a CNN see local information. Sorting reads by haplotype means that even at the lowest level, the network can learn that adjacent reads likely come from the same haplotype.

Haplotype DeepVariant CCSHaplotype sorting had a small positive impact on SNPs, improving F1 from 0.9986 to 0.9988. However, as Figure 4 shows, the effect for Indel F1 was large, increasing F1 from 0.9495 to 0.9720. Now that we know haplotype information has such a strong positive effect, we can consider how to add this using only the PacBio data.

Using Nucleus for Error Analysis

Jason Chin of DNAnexus (and formerly of PacBio) performed  several interesting analyses on top of the open-source Nucleus developed to simplify bringing genomics data into TensorFlow.  See this Jupyter Notebook for a hands-on demonstration using data from a public DNAnexus project to fetch the sequence context around false positive and false negative sites.

We encoded the sequence alignment of the flanking regions (totaling 65 bases) of each site as two 65 x 4 matrices. The matched, mismatched or missing bases in the alignments were encoded to the first 65 x 4 matrix. In this matrix, each column had 4 elements and counts the number of A/C/G and T bases of the reads that match the reference at a given location.  The inserted sequences relative to the reference were encoded into the second 65 x 4 matrix.

We collected the error context matrices and treated them like high dimension vectors. We applied common dimensionality reduction techniques to see if we could find common patterns around the erroneous sites. Indeed, intriguing clustering structures appeared when we applied T-SNE or UMAP (see the figure below).  While long homopolymer A or T sequence cause the majority of the errors, we observed other less trivial common patterns, e.g., di-nucleotide or tri-nucleotide repeats. We also discovered a set of reads that have approximately a common prefix that is corresponding to Alu repeats.

Figure 6. A number of repeats patterns identified by a UMAP embedding of the alignment vectors around the residue error sites.UMAP DeepVariant CCS

Future Work

The models generated for this analysis are currently available by request to those considering CCS for their workflows. We expect to make a PacBio CCS model fully available and supported alongside our Illumina WGS and exome models in the next DeepVariant release (v0.8).

Though we feel the current work is strongly compelling for use, we identified a number of areas for continued improvement. Currently, we have trained with examples from only one CCS genome on a single instrument, compared to 18 Illumina genomes from HiSeq2500, HiSeqX, and NovaSeq. Simply having more training examples should improve accuracy.

The ability to generate phasing information solely from the PacBio reads would provide another large gain. We are investigating whether similar approaches are possible to improve DeepVariant for Illumina data in variant-dense regions. Finally, hybrid Illumina-PacBio models are an intriguing possibility to explore.

We have continued to improve DeepVariant’s speed and accuracy, and we expect to achieve similar improvements on PacBio CCS data as it becomes widely used.

Requesting Early Access

To request early access to the PacBio CCS model generated for this work, you can email awcarroll@google.com. We expect general availability of this model alongside our Illumina WGS and Illumina exome models in our next DeepVariant release (v0.8).

The app is also available by request on DNAnexus. For access please email support@dnanexus.com. This app will be broadly available soon.

A Consensus of Scientific Expertise

The CCS manuscript investigates other applications: structural variant calling, genome assembly, phasing, as well as the small variant calling discussed here. This required bringing together many investigators with different specializations across both the wet lab and informatics.

All of these investigators play important roles in evolving PacBio CCS into wide application. We want to specially thank Billy Rowell from PacBio, who coordinated the small variant calling section, and Aaron Wenger, who coordinated the broader manuscript, and Paul Peluso and David Rank, who generated the CCS dataset. We also give special thanks to Jason Chin, who helped to bring our team into this investigation, and who has been responsible for advancing many cutting edge PacBio applications over the years.

New ENCODE Paper Reveals Remarkable Chromatin Diversity at Regulatory Elements

 

Today marks a major milestone for the ENCODE consortium! More than 30 papers will be published today in Nature, Genome Biology, and Genome Research from teams of scientists working on various facets of the project.

 

One of those, a publication in Genome Research, reports on a surprising level of heterogeneity among patterns of chromatin modifications as well as nucleosome positioning around regulatory elements such as transcription factor binding sites in the human genome. In the past, these genomic elements have been studied primarily by averaging patterns of chromatin marks across populations of sites, leading to the perception that patterns were much more uniform. The nucleosome positioning sequence data mapping and analysis was performed on the DNAnexus platform.

 

Lead author Anshul Kundaje was a postdoc for Serafim Batzoglou and Arend Sidow at Stanford University during the project reported in the paper. Now a research scientist at MIT, Kundaje says the work was an integral part of the ENCODE consortium’s efforts to elucidate functional elements in the human genome. The scientists looked at 119 human transcription factors and regulatory proteins to better understand how nucleosomes are positioned and how histone modifications are made around binding sites. In the paper, the authors report that asymmetry of nucleosome positioning and histone modifications is the rule, rather than the exception.

 

Kundaje and his colleagues relied on ChIP-seq data for the 119 transcription factors in a variety of cell types, with corresponding data for histone modifications. They also generated similar data for nucleosome positioning. To improve accuracy, the team sequenced extremely deeply, ultimately generating some 5 billion reads on the SOLiD sequencing platform. “The data sets were incredibly massive,” Kundaje says. “Processing these data sets locally was quite a challenge.” The group turned to DNAnexus, uploading their sequence files to the cloud and preprocessing the data with the company’s probabilistic mapping tool. “DNAnexus made that process incredibly simple,” he adds.

 

Figure 1:The mapping of the 5 billion reads was performed using the DNAnexus mapper.

Using a new tool they developed — the Clustered AGgregation Tool (CAGT) for pattern discovery — the scientists found that nucleosome positioning and histone modification at transcription factor binding sites is far more diverse than was previously thought. Rather than averaging across the regions as most studies have done, the new clustering tool was able to analyze the differences in magnitude, shapes, and orientation of the many patterns identified.

 

“What we found is that the results you get from the clustering approach are dramatically different from what you get by simply averaging across all types,” Kundaje says. “We found a large diversity of patterns of histone modifications as well as nucleosome positioning around almost every transcription factor binding site.”

 

Even the well-known and remarkably well studied transcription factor CTCF, long established as an insulator, was found to have surrounding chromatin patterns pointing to other functions throughout the genome.

 

Figure 2: Analysis using CAGT reveals the surprising diversity of patterns of an active chromatin mark H3K27ac around the binding sites of the CTCF protein that is well-known for its repressive insulator role.

 

The authors used their clustering tool to group the patterns into some 25 distinct signatures “that completely capture the diversity of all the modifications across all binding sites in a variety of cell types,” Kundaje says. The method uses ‘metapatterns’ to explain that diversity, and that information can reveal the function of these elements in context. “By accounting for combinatorial relationships between various binding events and how they affect chromatin, this gives you a more complete biological sense of what a transcription factor is doing in a cell type,” he adds.

 

Kundaje is already following up on this study by looking at other species to see whether the heterogeneity of modification patterns holds true in other organisms. He continues to use DNAnexus for analysis of sequencing data, especially in read mapping, quality control, and genome browsing, he says.

 

Using DNAnexus for the team’s ENCODE study “made the process significantly easier,” Kundaje adds, noting that the cloud provider’s direct integration of the genome browser was particularly helpful. DNAnexus allowed Kundaje and his colleagues to go from data to visualization with minimal processing steps in between, he says. “It frees up your time to focus on the more interesting work.”

 

For a glimpse of some of Kundaje’s data, DNAnexus has made the 20 samples available on DNAnexus in the Public Data folder, called Encode. Click here to sign up for a free account.

 

Check out the Kundaje et al. paper “Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.”

 

Scientific Collaborators in New York and Jerusalem Uncover New Mutation Underlying Rare Sensory Disease

The study described below was published in the April 2012 edition of the Annals of Neurology, the journal for the American Neurological Association and the Child Neurological Society.

At a time when many people are asking when DNA sequence information will have a real application in healthcare, a nonprofit organization based in Brooklyn, New York, is proving that linkage mapping and exome sequencing are already making a major difference in people’s lives.

Bonei Olam is charged with helping families with genetic or undiagnosed diseases, many of them dealing with infertility challenges, to conceive healthy babies using tools such as pre-implantation genetic diagnosis. In 2008, Bonei Olam opened its Center for Rare Jewish Genetic Disorders to find the underlying molecular causes of some of these conditions. They responded to hundreds of families who had been unable to find a diagnosis through traditional medical routes, and established key collaborations with universities, including Hadassah Medical Center in Israel, to help perform the studies. Over the years, the center has funded SNP arrays, Sanger sequencing, variant validation, and the sequencing of more than 100 exomes for family after family.

It was through one of those studies that collaborators from Hadassah Medical Center, New York University, and Bonei Olam discovered a novel mutation that leads to a previously uncharacterized disease linked to hereditary sensory autonomic neuropathy, a group of disorders with the common theme of loss of function in peripheral sensory nerves. This new version is far more severe than its familial dysautonomia cousin and is caused by a mutation in the DST gene, which destabilizes the Dystonin protein. The study, called “Hereditary sensory autonomic neuropathy caused by a mutation in dystonin” was published by Dr. Simon Edvardson, Prof. Orly Elpeleg, and the rest of their team in the Department of Genetic and Metabolic Diseases at Hadassah in the April 2012 issue of Annals of Neurology.

Chaim Jalas, Director of Genetic Resources and Services at the Center for Rare Jewish Genetic Disorders and a co-author on the paper, says that this particular project began when two related families approached Bonei Olam, each having lost at least one child to this uncharacterized disease. The disorder was lethal: all of the affected children, three in total, died by the age of 2.

The team started off with SNP arrays to perform linkage analysis in both families, and later performed exome sequencing on one of the affected children to find the causative mutation. Much of the clinical and functional work in the eight-month project — including identifying the mutated gene and studying its effect in cell lines — was led by Prof. Orly Elpeleg at Hadassah Medical Center in Jerusalem.

Since Bonei Olam doesn’t have an in-house bioinformatics team, Jalas relies for interpretation on various software tools as well as the cloud-based storage and analysis platform from DNAnexus. For this project, he uploaded the raw sequence reads to DNAnexus and ran the Exome analysis tool followed by the Variant analysis tool, which located the mutation — the DST variant that results in an unstable transcript in Dystonin, a protein used in the cytoskeleton. “What DNAnexus does for us is all the bioinformatics, starting from uploading raw reads to performing the alignment, the variant calling, the annotation, and graphical display of the reads on the reference genome,” Jalas says. In the past, the DNAnexus Variant analysis tool has been able to find a variant that other software packages have missed, he says — but he’s most confident when two different software packages call the same variant so it’s more likely to be real.

Once DNAnexus returned the answer, Jalas shared the data with his collaborators, who could log in with their own accounts to review the information. Finally, the research team confirmed the mutation by Sanger sequencing.

Since Bonei Olam isn’t your typical research institute, the real triumph was not the research finding or the publication of this mutation; it’s that “one of the two families is currently pregnant with a healthy baby,” Jalas says.

Ultimately, the success of these studies may prompt Bonei Olam to move toward whole genome sequencing. “I think at some point we will do whole genomes,” Jalas says. “We’re looking into a pilot study of families for whom exome sequencing did not find a causative genetic mutation where we know for sure it’s a genetic condition.”

Paper information:
Hereditary sensory autonomic neuropathy caused by a mutation in dystonin
Simon Edvardson, MD; Yuval Cinnamon, PhD; Avraham Shaag, PhD; Orly Elpeleg, MD, from Monique and Jacques Roboh Department of Genetic Research, Hadassah, Hebrew University Medical Center
Chaim Jalas, from Bonei Olam, Center for Rare Jewish Genetic Disorders
Channa Maayan, MD, from Department of Pediatrics, Hadassah, Hebrew University Medical Center
Felicia B. Axelrod, MD, from Department of Pediatrics, New York University School of Medicine
DOI: 10.1002/ana.23524
http://onlinelibrary.wiley.com/doi/10.1002/ana.23524/abstract