One Genome Browser to Rule Them All?

Maria Nattestad

 

 

 

At VIZBI 2018 (Visualizing Biological Data) [https://vizbi.org/2018/], I proposed a topic for a breakout session, called “One Genome Browser to Rule Them All”. This led to a very interesting discussion that I found fascinating and inspiring. I am writing this up here so we all have a place to continue the discussion and explore where to go next.

The basic idea is that there should be a genome browser that the whole community of researchers can use and build on, creating new track types as plugins that can be shared with other scientists, without the complexity of modifying the core framework.

We are at a point where web technologies have matured and made it possible to have more modularity and reusability of code on the web, just like what was possible in Java that enabled Cytoscape to have a very useful library of community-created plugins. This kind of large-scale collaboration to build an ideal genome browser is, I believe, at least an interesting thought experiment as the bioinformatics field matures and we are joined by many of the world’s biologists who rely increasingly on genome and transcriptome sequencing for their innovative research.

So here is roughly what we discussed during this session, with some blue sky wishful thinking:

Built for the web

We agreed that a JavaScript genome browser would be preferable to a desktop application given how much of genomics has moved onto the web with large databases of resources, cloud platforms, and lots of academic web applications that make genomics more accessible. If the genome browser is built for the web, each of these groups of services could have the genome browser embedded into it for better interactive exploration of the data. Among the genome browsers built for the web, some only need to be run from the front-end (IGV.jspileup.jsBiodalliance) while others also need some back-end, server-side code or configuration to be used (JBrowseNGB). It is significantly easier to embed the front-end browsers into a web application, whether that is a small academic project or a large platform like Galaxy or DNAnexus. If a back-end is the only way to improve performance, perhaps it can be an optional feature. I’m curious to hear people’s experiences on when having a back-end improves performance, and whether this is something that could be separated intelligently.

Future-proof

Importantly, this genome browser should also make fewer assumptions about how data will be visualized, to future-proof it against new technologies and ideas that need new types of visualization. Examples of limitations that are hard to overcome in current genome include the difficulty of plotting features that need to connect to multiple loci, which has become increasingly common in recent years for showing long reads (PacBio, Nanopore), linked reads (10X), and long-range variants including gene fusions, to name a few. Current genome browsers can sometimes show multiple loci next to each other, but the tracks cannot cut across and connect to more than one of these loci. See these examples of connecting across multiple loci from Ribbon and gGnome:

gGnome and Ribbon Browsers

Genome Browser Caption

Ann Loraine, who is part of the IGB team, explained that the Integrated Genome Browser (IGB) is an older genome browser that predates IGV. The IGB team had attempted to add multi-locus functionality more recently, but they found that it was no longer possible to make such a fundamental change to the architecture of IGB. All software developers have to make assumptions when starting a project, but it’s important to examine those assumptions and consider what all the features are that we might eventually want to include. That is why future-proofing is such an important topic for this discussion.

To highlight another dimension of future-proofing, Valerie Schneider of the NCBI pointed out that a flood of new genomes, especially from the Vertebrate Genome Project, is one of the challenges she sees on the horizon for the NCBI’s own genome browser. Valerie also mentioned that enabling the genome browser connect to data from government and consortia resources would be very important.

Christian Stolte of the New York Genome Center showed a mini-browser within MetroNome that visually connects protein domains to their genomic coordinates along a gene. That example brought up an interesting point that non-genomic coordinates such as transcript and protein domains could also become first-class citizens. So then I guess we are no longer just talking about a genome browser but actually a multiome browser!

Notice that all of these use cases we discussed for future-proofing are already needed today and supported by niche mini-browsers. If you know of other current and predicted future use cases that are not already covered by existing genome browsers or by the ideas presented here, please add them in the comments, so we can all broaden our horizons and expand our vision for the ideal genome browser.

A library of community-contributed plugins

Before I built each of my own visualization tools, I tried to add the functionality I wanted to the genome browser I was using at the time, the desktop version of IGV. First, I needed to show two loci at the same time, with connecting lines between them for long-range variants. IGV could show two regions at once, but they were independent and I couldn’t draw lines across them — so I built a visualization tool called SplitThreader to show long-range variants as connecting lines between two loci. Later, I needed to show alignments of long reads in a way that you can tell where the alignments are along the read’s own length in addition to where they are on the reference, so I built Ribbon. The funny thing is, after building the parts of the visualization that are really novel, you still have to add several parts that normal genome browsers can do just fine, like drawing genes, variants, other features from a bed file. I would end up spending over 50% of my time implementing features that already exist elsewhere, and pretty shallow versions of them too.

Since starting my data visualization role at DNAnexus I also have a whole new appreciation for modularity because the platform is built to be able to run virtually any kind of analysis on any kind of data applicable in genomics. I dream of having a genome browser that could be equally flexible and modular, that anyone could contribute new types of visualization to, and that could be integrated and used anywhere from consortium database explorers to digital paper figures. If anyone can build an app on DNAnexus or Galaxy and publish it for others to use, then why is it not the same for adding a special track to a genome browser?

Conclusions

I would like to mention that the genome browsers out there right now are fantastic. When thinking about this ideal genome browser, there are several ideas that come directly from existing tools: IGV.js is very lightweight and does an excellent job of being embeddable anywhere, while JBrowse has a library of community-contributed plugins that I find very encouraging for collaboration in this field.

Genome browsers have significant similarities with each other, but each one also has its unique strengths that not all of them share. When I need to build a novel visualization for a particular data type, concept, or application, I would love to be able to just build a track into an already powerful and full-featured genome browser, instead of building the novel feature into its own tool and then adding genes, bed, and variants track types. The users of my tools would surely rather have the full power of an IGV or a JBrowse at their fingertips even while looking at their long-read alignments in a Ribbon browser.

Now I would love to hear what you think!

Are there any genome browser teams out there who have been thinking about modularity, plugins, future-proofing, and making a lightweight yet powerful genome browser for the web?

Amplifying Google’s DeepVariant

In this blog we quantify Google Brain’s recent improvements to DeepVariant – detailing significant improvement in both exomes and PCR genomes. We reflect on how this improvement  was achieved and what it suggests for deep learning within bioinformatics.

Introduction

The Google Brain Team released DeepVariant as an open-source GitHub repository in December 2017. Our initial evaluation of DeepVariant and our Readshift evaluation (blog) method (code) identified that while DeepVariant had the highest accuracy of all methods on the majority of samples, there were a few outliers with much higher Indel error rates.

Subsequently, we realized that these difficult samples had a common feature – they were all prepared with PCR. Also, the initial release of DeepVariant had not been trained on exome data. Based on this, DNAnexus provided exome and Garvan Institute PCR WGS samples to Google Brain to train improved DeepVariant models, which are now released. We also recognize Brad Chapman, who independently collaborated with Google Brain.

Improving Deep Learning Methods Compared to Improving Traditional Methods

Many factors make variant calling more complex in PCR+ samples and exomes. PCR amplification is biased by GC-content and other factors. Errors that accumulate in the PCR process are difficult to distinguish from true variants. Exomes have an additional complexity of uneven capture efficiency and coverage.

Knowledge of these facts is incorporated differently in human-programmed methods, traditional machine learning, and deep learning methods. Human-written methods require a programmer to carefully develop heuristics which capture the relevant properties without too much rigidity. Traditional machine learning approaches require a scientist to identify informative features that capture the information, encode those from the raw data, and train and apply models.

Deep learning based methods like DeepVariant attempt to represent the underlying data in as raw a manner as possible, relying on the deep learning models to learn the features. Here, the scientist’s role becomes to identify the examples that embody the diversity of conditions the tool must solve, and to organize how to present as many and varied of these examples to the training machinery as possible. How Google Brain made these improvements may be as interesting as the improvements themselves.

Exome-Trained Models Significantly Improve Performance

Google Brain’s initial release was DeepVariant v0.4. The v0.5 version added the ability to apply either a WGS model or an exome model. The v0.6 version specifically added PCR+ training for the first time.

All evaluations use the hs37d5 reference and v3.3.2 Truth Sets from NIST Genome in a Bottle. Evaluation is performed with Hap.py in the same method used in PrecisionFDA.  Google’s release notes indicate they never train on HG002 or chr20 of any sample.

On exomes, DeepVariant improved from approximate parity to a 2-fold error reduction relative to GATK. There are two ways to put this improvement in accuracy into context. DeepVariant has an absolute decrease of 231 errors and 121 false positives. For a callset of 500,000 exomes – as the UK BioBank will soon make available – the fact that false calls tend to distribute randomly while true variants are shared in the population could mean a reduction of 60 million false variants positions in the full callset (for reference, the 60,000 exome ExAC set contained 7 million variant positions).

Another way to frame the value is: does the improved accuracy allow similar performance as traditional methods at a lower coverage?Figure 2 demonstrates that DeepVariant on the same exome downsampled randomly to 50% coverage is more accurate than other methods at full coverage. This could allow a re-evaluation of required sequencing depth to either increase throughput, save cost, or refocus sequencing depth on complex but important regions.

Including PCR+ WGS Data Greatly Improves DeepVariant Performance

DeepVariant was not specifically trained on PCR-prepared WGS samples until the v0.6 release, when 5 PCR+ genomes were added to the training set. Even before this addition, DeepVariant was the most accurate SNP caller surveyed in our PCR+ data. However, as Figure 3A indicates, the inclusion of PCR training data improved SNP accuracy noticeably.

Indels in these samples were a problem for almost all of the callers. Indel error rates were so much higher in these samples that this is the dominant error mode across every method. The exception was Strelka2, which performed vastly better. Illumina clearly put great care into modeling PCR errors when developing Strelka2. Edico was also able to use this observation to improve their DRAGEN method, as detailed in our recent blog.

This observation was key to realizing that the performance penalty was due to PCR. When facing a problem with additional noise and complexity, it can be unclear whether the lack of signal makes the problem fundamentally less solvable or if the problem remains similarly tractable, but instead requires more effort. Strelka2’s performance was proof that current methods could improve significantly.

As Figure 3B shows, the inclusion of PCR+ data has a huge impact on the error rate of DeepVariant, with a 3-fold error reduction. Simply identifying the right training examples was sufficient to go from worse than GATK to 10% better than the prior leader, Strelka2.

How Do New Training Data Impact Other Applications?

By including PCR+ and exome data in training, DeepVariant is now being asked to encode more information within its networks. Asking methods to multi-task in this way can lead to interesting effects and can force trade-offs which decrease overall performance.

An interesting phenomenon observed with deep learning methods is that challenging them with diverse, hard problems can sometimes improve general performance. Ryan Poplin, one of the authors of DeepVariant, discusses this phenomenon in a recent ML/AI Podcast on his work on an image classification method for diabetic retinopathy.

This could be understood in a few ways. First, harder problems may create pressure for network weights to identify subtle but meaningful connections in data. For example, a strong deep learning approach trained on novice chess opponents would only be so good. Another is that training examples that depend differentially on various aspects of a problem helps resist overfitting, allowing the model to discover and preserve subtle signals it could not otherwise. The result is a better generalist model.

There are many variables changed in the progression of DeepVariant from v0.4 to v0.6 – more overall training data has been added and tensors are changed – but if DeepVariant were to benefit from the complexity, one might hypothesize this would manifest in a more pronounced change on Indel performance, as that area is the most impacted in both the exome and PCR data. To test this, we ran each version on a “standard” benchmark – PCR-Free WGS HG002.

The new releases of DeepVariant show a trajectory of improvement, which is impressive given that DeepVariant led our benchmarks in accuracy even in its first release. SNP error rate is about 10% improved. The Indel error rate is improved more substantially, at a 40% reduction. Also, the greatest impact occurs with the addition of the PCR training data in v0.6. With this sort of uncontrolled experiment, we can’t conclusively say that this occurs because DeepVariant is learning something general about Indel errors from the PCR data. However, the prospect is enticing.

What Will Deep Learning Bring for the Field?

As deep learning methods begin to enter the domain of bioinformatics, it is natural to wonder how skills in the field will shift in response. Some may fear these methods will obsolete programming expertise or domain knowledge in genomics and bioinformatics. In theory, the raw methods to train DeepVariant on these new data types, confirm the improvement, and prevent regressions can be shockingly automatic.

However, to reach this point, domain experts had to understand the concepts of PCR and exome sequencing and identify their relevance to errors in the sequencing process. They had to understand where to get valid, labeled data of sufficient quantity and quality. They had to determine the best ways to represent the underlying data in a trainable format.

The importance of domain expertise – what is sequencing and what are its nuances – will only grow. How it manifests may shift from extracting features and hand-crafting weights to identifying examples that fully capture that nuance.

We also see these methods as enabling, not deprecating, programming expertise – these frameworks depend on large, complex training and evaluation infrastructures. The Google Brain team recently released Nucleus, a framework for training genomics models. Nucleus contains converters between genomics formats like BAM and VCF and TensorFlow. This may allow developers to tap into deep learning methods as specialized modules of a broader bioinformatics solution.

We hope that amongst the detail, this blog has communicated that deep learning is not a magic box. It remains essential for a scientist to carefully consider what the nuances of a problem are; how to rigorously evaluate performance; where blind spots in a method are; and which data will shine light on them.

For St. Jude, Advancing Cures for Pediatric Cancer Means Accelerating Genomic Discovery and Collaboration

Angela Blog Author

 

 

 

Historically, cancer research has been slowed by an inability to make genomic data rapidly accessible to research collaborators. Last week, St. Jude Children’s Research Hospital took a big step toward solving this problem with its launch of St. Jude Cloud, an online platform that allows researchers to access the world’s largest public repository of pediatric cancer genomics data. Developed in partnership between St. Jude, Microsoft, and DNAnexus, St. Jude Cloud provides a flexible cloud platform for rapid data mining, analysis and visualization capabilities.

St. Jude has long been a leader in advancing cures for pediatric cancer and other life-threatening diseases, and continues to develop new approaches to revolutionize the way medicine is practice. St. Jude Cloud is the latest unique tool developed in the fight to advance cures for pediatric diseases. DNAnexus is proud to serve as the technology platform that brings together St. Jude researchers and their partners in a secure and collaborative ecosystem.

Collaboration fuels scientific advancements, and St. Jude Cloud is already doing just that. In a paper that was recently published in Nature, St. Jude researchers lead by Jinghui Zhang, PhD, discovered mutations connected to UV damage in a B-cell leukemia. This was a very surprising finding and led the team to ask whether other leukemia samples not included in the original study might have a similar mutational pattern. Scott Newman, PhD, used St. Jude Cloud to reproduce the original experimental findings in just a few days whereas the original research took more than two years to complete.

Using St. Jude Cloud, Newman was able to conduct large-scale data analysis enabling him to identify the same UV-linked mutational signature in pediatric B-Cell leukemia patients over four days. Discovering these additional samples further helped researchers understand the possible link between UV damage and a blood cancer and potentially leads to the development of new therapies. Learn more about the St. Jude Cloud and its research capabilities via Q&A with Newman featured in the St. Jude Progress.

Like St. Jude Cloud, DNAnexus delivers fit-for-purpose community portals to advance scientific research through a secure and collaborative online environment that has been independently audited and certified. DNAnexus community research portals allow members to focus on discovery and innovation, removing the burden of secure data management, distribution, and data analysis. Other community research portals powered by DNAnexus include the FDA’s precisionFDA platform for advancing regulatory standards for NGS-based drug and devices, and the microbiome research platform, Mosaic, which facilitates the translation of microbiome research into clinical applications.

Learn more about DNAnexus community portals and determine which use case is right for you.