DNAnexus at ASHG: Elevating Translational Informatics

ASHG Logo

We are gearing up for the annual American Society of Human Genetics (ASHG) meeting next week in San Diego, and are especially excited to debut our new ApolloTM platform for multi-omic and clinical data science exploration, analysis, and discovery. ApolloTM provides translational researchers with a scalable cloud environment, flexible data models, intuitive analysis and visualization tools to simplify research workflows for R&D teams globally and dramatically improve the efficiency of research organizations.

Visit DNAnexus in booth 622 to learn more about how to leverage our newApolloTM to inform decision making, save time, and maximize value at each step of the drug discovery process. Stop by our booth anytime during the conference, or email us to schedule a meeting with a member of our science team.

Lunchtime Talk:

Leveraging Translational Informatics for the Advancement of Drug Discovery & Improved Clinical Outcomes

Thursday, October 18th, 12:30pm – 1:45pm
San Diego Convention Center, Upper Level, Room 30E

Join us to learn how biopharma customer MedImmune and academic medical center, Baylor College of Medicine’s Human Genome Sequencing Center are leveraging massive volumes of biomedical data to gain better insights into biological, environmental, and behavioral factors that influence health.

Speakers:

  • Medimmune LogoDavid Fenstermacher, PhD, Vice President R&D & Bioinformatics, MedImmune
  • Will Salerno, PhD, Director Genome Informatics atBaylor College of Medicine Human Genome Sequencing CenterHGSC Logo
  • Brady Davis, VP Strategy, DNAnexus  

Lunch will be provided; RSVP to reserve your spot!

Activities in DNAnexus Booth 622:

Visualization Hour – Come Explore GWAS data in 3-D, using virtual reality!

Human brains are wired for spatial reasoning, making virtual reality a potentially powerful way for scientists to achieve an intuitive understanding of data. To test VR on genomic data, we combined two iconic visualizations in genomics, the Manhattan plot and the circos plot, into a fully immersive data exploration experience called BigTop. Come explore the GWAS circus with us and learn about other exciting visualization projects at DNAnexus!

  • Wednesday, 10/17, 12-1pm
  • Thursday, 10/18, 12-1pm
  • Friday, 10/19, 12-1pm

Translational Informatics Hour – Elevating Translational Informatics

Come demo the new DNAnexus ApolloTM and hear how biopharma customer MedImmune is using it to inform decision making, save time, and maximize value at each step of the drug discovery process.

  • Wednesday, 10/17, 2-3pm
  • Thursday, 10/18, 2-3pm
  • Friday, 10/19, 2-3pm

Meet the xVantage Group

Come meet members of xVantage Group, a dedicated team with deep technical and scientific expertise, ready to create innovative and tailored solutions for customers and partners. From data ingestion to pipeline and software development, learn more about the broad range of services and customers xVantage is supporting.

  • Wednesday, 10/17, 10-11am
  • Thursday, 10/18, 10-11am

Customer Talks:

Wednesday, October 17th
Time Title Speaker Location
5:30 PM Sequencing of whole genome, exome and transcriptome for pediatric precision oncology: Somatic variants and actionable findings from 253 patients enrolled in the Genomes for Kids study Scott Newman, St Jude Children’s Research Hospital Session #25 – Integrated Variant Analysis in Cancer Genomics; Ballroom 20BC
6:15 PM Structural variation across human populations and families in more than 37,000 whole-genomes Will Salerno, Human Genome Sequencing Center, Baylor College of Medicine Session #33 – Characterization of Structural Variation in Population Controls and Disease; Ballroom 20A

Posters Featuring DNAnexus:

Wednesday, October 17th
2:00pm-3:00pm
Poster Number Title Affiliation
PgmNr 1761 Variant identification from whole genome sequencing at the UPMC Genome Center UPMC Genome Center
3:00pm-4:00pm
Poster Number Title Affiliation
PgmNr 1998 Hi-C-based characterization of the landscape of physically interacting regions and interaction mechanisms across six human cell lines using HiPPIE2. University of Pennsylvania, DNAnexus
PgmNr 1554 A high-quality benchmark dataset of SV calls from multiple technologies Illumina, Baylor College of medicine
PgmNr 3186 Identification of novel structural variations affecting common and complex disease risks with >16,000 whole genome sequences from ARIC and HCHS/SOL. Human Genetics Center, University of Texas Health Science Center, Baylor College of Medicine HGSC, Albert Einstein College of Medicine, DNAnexus, Fred Hutchinson Cancer Research Center, University of Washington, Johns Hopskins University

 

Thursday, October 18th, 3:00pm-4:00pm
Poster Number Title Affiliation
PgmNr 1222 Novel mutations of SCN9A gene in patient with congenital insensitivity to pain identified by whole genome sequencing. Intermountain Healthcare
PgmNr 1648 How well can you detect structural variants: Towards a standard framework to benchmark human structural variation. NIST, NHGRI, Genome in  a Bottle Consortium Baylor College of Medicine HGSC, PacBio, Spiral Genetics, NCBI, BioNano Genomics, 10x Genomics, Max Planck Institute, USC, Boston University Medical School, DNAnexus, Joint Initiative for Metrology in Biology

 

Friday, October 19th, 2:00pm-3:00pm
Poster Number Title Affiliation
PgmNr 1439 Are we close to constructing a fully diploid view of the human genome? DNAnexus
PgmNr 1505 How well can we create phased, diploid, human genomes? An assessment of FALCON-Unzip phasing using a human trio. DNAnexus
PgmNr 2732 A population genetics approach to discover genome-wide saturation of structural variants from 22,600 human genomes Center for Integrative Bioinformatics Vienna, University of Vienna, Baylor College of Medicine HGSC

Announcing the Results of Mosaic Clinical Strain Detection Challenge

Gabriel Al-Ghalith1, Sam Westreich2, Michalis Hadjithomas2

1Janssen Human Microbiome Institute, Janssen Research & Development, LLC, 2DNAnexus, Inc.

Mosaic Community Challenge: Clinical Strain Detection is the second in a series of three challenges on Mosaic sponsored by Janssen Research & Development, LLC through the Janssen Human Microbiome Institute designed to foster collaboration and advance microbiome research. In order to develop microbial-based products, it is critical to accurately detect microbes at the strain level. This challenge called upon the community to advance accurate strain-level profiling by determining whether or not known bacterial organisms were present in microbiome samples.

We recently announced that Richa Agarwala and Sergey Shiryev from the National Center for Biotechnology Information (NCBI) and a team from One Codex scored the highest in the challenge! Here we discuss the results and lessons learned in more detail.

Challenge Results

The challenge datasets were composed of four metagenomes generated from human fecal samples and genome datasets containing 40 genome assemblies. They were tasked with determining which of the 40 organisms were present in the four microbiome samples, with genomes of these 40 known organisms provided. There were a total of 54 submissions for the challenge. Below are the final standings for the challenge based on the best submission from each of the 14 participants (in case of ties, earliest submissions are placed higher):

Ranking  Submitted As  Adjusted Rand Index (Score)  Accuracy  Submitted on
1 Richa Agarwala 96.83% 97.5% 22-Jun-18
2 One Codex 96.83% 97.5% 03-Aug-18
3 Anonymous 90.60% 92.5% 03-Aug-18
4 Andreu Paytuvi 90.41% 92.5% 25-Jul-18
5 Varun Aggarwala 90.41% 92.5% 02-Aug-18
6 Ardigen R&D 87.67% 90.0% 03-Aug-18
7 Krzysztof Odrzywolek (Ardigen) 87.15% 90.0% 05-Jul-18
8 Scott Sherrill-Mix 77.19% 82.5% 25-Jul-18
9 Jeremy 77.19% 82.5% 31-Jul-18
10 Caner Bagci 70.35% 77.5% 25-Jul-18
11 Gabe Al-Ghalith (BURST) 66.86% 75.0% 03-Aug-18
12 Anonymous 36.64% 55.0% 27-Jul-18
13 Anonymous 17.42% 32.5% 03-Aug-18
14 Bert Gold 6.03% 27.5% 03-Aug-18
NR Gabe Al-Ghalith (SDP) 96.83% 97.5% N/A

Teams from NCBI and One Codex tied for first place with an Adjusted Rand Index score of 96.83%. Both of these participants correctly assigned 39 out of the 40 organisms. Although the teams had identical entries, they took entirely different approaches to the challenge: the NCBI team used a read alignment approach, while the One Codex team used a k-mer approach. Read about their approach below, and hear directly from the participants in a live webinar on October 10th (details below).

Overall, participants performed very well in terms of Precision, with the major difference between submissions being Recall.

Precision and RecallMore than half of the 54 submissions to the challenge achieved an accuracy higher than 75% and correctly assigned at least 30 of the 40 organisms in this challenge. Additionally, more than 50% of submissions achieved a True Positive rate higher than 75%, or more than 20 of the 26 non-decoy organisms were correctly assigned. The histogram below shows the distribution of True Positives across in the 54 challenge submissions. All the participants failed to correctly assign one organism, Lachnospiraceae_C_2.

Submissions

Interestingly, the participants generally achieved high precision. As we will explore more later, this result indicates that high-scoring participants have used more refined approaches than naïve short-read alignment. Below are overviews of the approach used by the two teams with the best performing submissions (accuracy of 39/40). Both teams will present their approaches in more detail at our Mosaic webinar on Oct. 10.

The NCBI Team Approach

The NCBI team used the SRPRISM alignment software that provides explicit guarantees about its output. The software reports all alignments, up to 250 alignments with the same rank, as determined by the number of errors. If reads are paired, and a paired alignment within the specified insert range is found, then only paired alignments are reported.

The second step in the NCBI team’s approach was to take SRPRISM alignments and judge whether there is sufficient coverage for a reference in a read set. SRPRISM was allowed a maximum of seven errors during the alignment, but only error-free alignments were used for computing coverage as the goal was strain detection and not species detection. Because metagenomic reads usually have only partial coverage of a reference, . For instance, if a query aligned against positions 5000 to 5150 of a reference genome, the alignment would be recorded as having actually spanned 1000 bases before and after this range, in this case positions 4000 to 6150. A reference with at least 99% of its bases covered by at least one padded alignment was considered as present. The final step was a coarse-grained manual review of coverage by converting the coverage to a heatmap, which resulted in changing one value of the output matrix.

The One Codex Team Approach

One Codex used the upcoming next release of its core metagenomic classification software. At a high level, One Codex’s approach is a two-stage process that first breaks individual next-generation sequencing (NGS) reads down into short, k-length nucleotide strings (or “k-mers”) and classifies those k-mers against its proprietary, curated database. Next, the distribution of k-mers observed in the sample vs. available reference genomes is analyzed. The upcoming version of One Codex’s pipeline demonstrated here uses MinHash sketches to improve performance and scalability.

Hear directly from the winners about their methods during our free webinar October 10th at 10:00 a.m. PT (1pm ET). The webinar is free, register today!

Learnings from the Challenge

A simple but standard approach to the problem of identifying which genomes are present in a community is to perform short-read alignment against the database of genomes provided by the challenge and aggregate the alignment results by fraction of each genome covered. Higher coverage of a genome would indicate higher likelihood that genome is present in a given sample. To explore how the winning approaches achieve higher performance than this baseline method, we attempted a simple  alignment and coverage approach ourselves. We compared two alignment approaches to calculate the fraction of each reference genome that is covered by the metagenomics reads in each sample. The first approach used ” run mode, which considers all possible alignments meeting or exceeding a specified sequence similarity threshold (in this case, 99%), then computes a simple coverage estimate in terms of percentage of genome covered. In this mode, BURST explicitly guarantees recovering all end-to-end alignments up to the similarity threshold, including an arbitrary number of read errors and number of alignments, using an exact (non-heuristic) alignment approach.

Below are the genome fractions aligned for each reference genome using these 2 approaches:

Genome Fraction Aligned

The table above compares the genome fractions achieved using these two alignment approaches.

As shown in the plot below, the coverages for all reference genomes (light grey) are compared with those for the positive truths, i.e., genome fraction of reference genomes using the reads from their respective microbiome of origin (red).

Genome Fraction Minimap

In general, the BURST approach achieved higher genome coverage. Most importantly, however, the BURST approach identified a minimum genome fraction threshold (between 0.87 and 0.93) that could be used to correctly assign 17 of the 26 non-decoy organisms without any false negatives. The equivalent minimum genome fraction threshold for the minimap2 approach is between 0.92 and 0.98 and would only correctly identify 11 out of the 26 positive truth organisms before false positive calls would be made.

What these results show is that although calculating the genome fraction covered by the metagenomic reads is important to properly identifying whether the organism is present, it is not sufficient. Nine out of the 26 non-decoy organisms had a genome fraction alignment below 70%, while 6 were below 40%. Furthermore, there were 8 cases where reads coming from a decoy sample (i.e., the microbiome was not the source of the organism) covered more of the reference genome than the reads source sample (see table below).

Microbiomes

For example, Bifiidobacterium_C_1 had a higher fraction of its genome aligned by reads in Samples 3 and 4 than Sample 1, which is the microbiome from where the organism was isolated (black borders in table). A likely explanation of this is that Samples 3 and 4 have more closely related strains to Bifiidobacterium_C_1, which leads to more reads being mapped to the reference genome.

These results show that while calculating the raw genome fraction covered by the metagenomic reads is an important step to identifying whether the organism is present in a microbiome sample, it is not sufficient. A key difference from the winning NCBI method described previously is that no “padding” was used around the aligned regions; the NCBI team used 1kb of padding around each alignment.

As expected, participants performed well in correctly identifying the source of non-decoy strains when there were enough reads to cover more than 90% of the organism’s genome. The table below shows the percentage of submissions with all possible organism-to-sample assignments. The truths for non-decoy organisms are highlighted with a border and decoy organisms and samples are in red. All the participants failed to correctly assign one organism, Lachnospiraceae_C_2, whose genome was covered only 11% by reads coming from the source microbiome (Sample 2).

Microbiome Results

The following graph compares the call rate for each organism with the genome fraction aligned by BURST.  More than 75% of the submissions correctly identified the source of non-decoy genomes when more than 90% of their genome was covered by the reads of the source microbiome.

Call Rate

 

To hear more directly from the winners, register today for our free webinar on October 10th at 10:00 a.m. PT (1pm ET)!

Parliament2: Fast Structural Variant Calling Using Optimized Combinations of Callers

Samantha Zarate

This week, we and our collaborators at the Human Genome Sequencing Center at Baylor College of Medicine uploaded a preprint for Parliament2, an optimized method to deploy multiple of structural variant callers and combine their outputs into a single consensus set of structural variants. The code for Parliament2 is open source on Github, available as a Docker container, and as an app on DNAnexus. In this blog post, we explore the reasons to run Parliament2 and add structural variant calling to your genomic analyses.

Introduction

Structural variants are large (50 bp+) rearrangements in a genome. While SNPs and Indels are much smaller than the size of an individual Illumina read (~150 bp) and can be observed directly, structural variants are large and complex enough that mapping in and around them is difficult. Thus, structural variants must be indirectly inferred by the “shadow” they cast in the genome.

Structural variants come in a number of categories: deletions, insertions, inversions, translocations, duplications, mobile elements, and repeat expansions/contractions. The signals used to detect them are similarly diverse, including read depth, split-read mapping, read-pair orientation, insert-size distribution, and soft-clipping of sequences. These signals can vary strongly depending on the conditions and design of the sequencing experiment.

Method authors exploit different combinations of these signals to find structural variants. Some of the many methods include Breakdancer, Breakseq2, CNVnator, Delly, Lumpy, and Manta to name a few. We have developed Parliament2 as a method that allows a user to quickly and efficiently run multiple methods in a single execution and to combine them in a manner that maximizes discovery power, and does so quickly, taking 3 hours of wall-clock time and around 60 core-hours to run.

Parliament2 represents a re-engineering of the first version of Parliament, which was developed in a collaboration between the HGSC and DNAnexus. The first version of Parliament could incorporate both long-read and short-read data and used assembly to validate events. Parliament2 currently only works on Illumina data and has been designed to achieve significant improvements in speed and efficiency, allowing its application at the scale of 100,000+ WGS cohorts. Parliament1 remains available in the DNAnexus app library.

Because the Parliament2 manuscript goes into significant detail, in this blog, we will summarize the key scientific and logistic advantages of Parliament2.

Parliament2 Runs Multiple Structural Variant Callers, Combines Them in a Standard Format, and Genotypes Them

Parliament2 Tools

The figure to the right shows a schematic for the data flow and tools used in Parliament2. Starting from a BAM or CRAM and its associated reference genome file, Parliament2 runs any combination of the callers Breakdancer, Breakseq2, CNVnator, Delly, Lumpy, and Manta as specified by the user.

Calls are integrated with SURVIVOR. This step combines events discovered in multiple methods, preserving which methods support each event. These calls are genotyped with SVTyper, which confirms events with an orthogonal computational method, resulting in higher precision and a genotype call.

Parliament2 is available as a Docker imagewhich comes pre-installed with all of the required dependencies for each program (important since some of them have mutual incompatibilites in libraries like pysam and therefore require their own environment). Parliament2 can be executed from any environment with a docker pull -> docker run command and a user can select which set of programs to run. For example, if you only want to run CNVnator, Parliament2 provides an easy way to do so.

Parliament2 Uses Multiple Parallelization Strategies to be Faster and More Cost-Efficient

Parliament2 ChartBreakdancer, CNVnator, Delly, and Lumpy do not effectively multi-thread (though Lumpy has been improved as smoove for this purpose). By managing parallel execution and result merging across chromosomes, Parliament2 can achieve significant speed increases compared to the most naive use of these tools.

This improvement speeds up individual tools. The larger advantage for compute resources comes from parallel execution. Each of the tools used has different requirements for CPU, disk I/O, and RAM. Running multiple methods together at the same time smooths the resource draw.

CPU UtilizationThe chart (A) to the right shows the percent core utilization achieved on a 16-core machine when running any of methods individually. Manta and Breakseq both finish quickly, while the other methods complete over the time of 2-3 hours, each using a fraction of total machine resources. Remember that this generally leads to wasted resources, as you are typically paying for a full machine’s worth of time regardless of how much you use.

CPU Utilization BThe chart to the left shows the CPU utilization in Parliament2 when running various combinations of callers. Overall, the machine is far more efficiently utilized by running multiple methods concurrently. From running Breakdancer, Manta, and CNVnator, with only 10 more minutes of job time, you can add Lumpy, Delly, and Breakseq almost “for free”.

Precision and Recall of Parliament2 (and Individual Methods)

In the remaining sections, we will analyze the precision of Parliament2 as well as that of the individual tools that compose Parliament. These comparisons are based on the Genome in a Bottle Structural Variant v0.6 Truth Set using the Truvari tool by Adam English of Spiral Genetics. Because the SV discovery power of short reads is weak for inserts in all methods (maximum recall of 17%), we will mostly present the deletion benchmarks here.

Parliament2 Achieves High Precision with Combinations of Calls

Precision Chart

The plot above is an UpSet plot, which shows how the combination of callers that supports an event reflects the probability that a call is correct. A filled black circle indicates a category where a tool called an event, so in the column where Breakdancer, Manta, and Breakseq each have a filled black circle, it means AT LEAST these methods made a call. This chart indicates that Manta achieves the best precision of individual methods at 90%, but calls supported by multiple methods can reach very high precision (99%-100%).

Precision Deletion Calls

The figure above shows a breakdown of all calls made by Parliament2 with their category of support. This gives a feel for how many calls are made at each confidence level. Any category below 50% precision is given a LowQual filter field in the VCF.

Parliament2 Gives Confidence Scores Based on Size and Event Type

In addition to filtering calls in the VCF, this same concept allows us to give a Phred-encoded quality value to each call based on the size of the event and the type of event. This allows a downstream analyst to create their own filtering criteria based on quality. Parliament Quality Values

Parliament2 Discovers More Events by Combining the Strengths of Multiple Methods

Parliament Recall

By running more tools, Parliament2 can find events that individual methods might miss. In particular, this is driven by differences in discovery power between different size ranges.

Only certain methods call deletions less than 300bp, and among those that do so, Manta demonstrates by far the highest recall at 55%.

However, for events larger than 1kbp, Manta is the fourth-weakest method.

In general, smaller SV events are more common in humans, meaning that the aggregate accuracy numbers are disproportionately influenced by small events.

Having additional callers to supplement the strong overall performance of Manta allows for additional discovery power, particularly for larger events. This can be especially important since large events disrupt greater fractions of the genome and may be disproportionately more likely to drive genetic disease.

In Conclusion

It is natural to fixate on questions of who has “the best” method. The Parliament methods (1&2) are not about writing better individual tools; the concept relies on finding the right way to ensure that in the combination of many methods, the whole is more than the sum of its parts.

Parliament illustrates a fact that can seem counter-intuitive. Even in the extreme situation where you have tool A which can discover every event that tool B finds and with fewer false positives, you can still do better by running both and gaining knowledge about which events are supported by both as opposed to only one.

Many scientists have brought an enormous range of opinions, effort, and creativity to bear in the field of calling structural variation. Parliament is a very specific and tangible manifestation of a statement which is even more true for science as a whole: even the most experienced can gain from the wisdom of others. And we are stronger working together than we are apart.

By fortunate chance, Samantha was presenting her work at 23andMe Genome Day when Tony Blair happened to be visiting. Samantha had the opportunity to present Parliament2 to the former UK Prime Minister. When learning about how Parliament enables the combination of many opinions for a greater outcome, Mr. Blair said something to the effect of “I hope that this is so”.

Tony Blair