Comparison of Somatic Variant Calling Pipelines On DNAnexus

The detection of somatic mutations in sequenced cancer samples has become increasingly standard in research and clinical settings, as they provide insights into genomic regions which can be targeted by precision medicine therapies. Due to the heterogeneity of tumors, somatic variant calling is challenging, especially for variants at low allele frequencies. Researchers use common somatic variant call tools, including MuTect, MuSE, Strelka, and Somatic Sniper,  that detect somatic mutations by conducting paired comparisons between sequenced normal and tumorous tissue samples. Each of these variant callers differ in algorithms, filtering strategies, recommendations, and output. Thus we set out to compare how these individual apps perform on the DNAnexus Platform. Each app was evaluated for recall and precision, cost, and time to complete.  

To benchmark some of the common somatic variant calling tools available on the DNAnexus Platform, our team of scientists simulated synthetic cancer datasets at varying sequencing depths. DNA samples from the European Nucleotide Archive were obtained and mapped to the hs37d5 reference with the BWA-mem FASTQ read mapper on DNAnexus.

These samples were then merged into a single BAM file representing the normal sample. To obtain the tumor sample, synthetic variants were inserted into each individual sample with the BAMSurgeon app on DNAnexus. All simulated samples were then merged into one BAM file constituting the tumor sample. Both the synthetic tumor and normal BAM files had approximately 250X sequencing depth.The synthetic tumor BAM file was then downsampled into a range of sequencing depths. With the help of sambamba through the Swiss Army Knife application, these files were reduced to 5X, 10X, 15X, 20X, 30X, 40X, 50X, 60X, 90X, and 120X coverage files. The file representing the normal sample was downsampled into a 30X sequencing depth file.  Once the synthetic cancer dataset was created, the common somatic variant calling tools MuTect, MuSE, Strelka, and Somatic Sniper were run to detect single nucleotide variants. Upon completion, the high quality variants were filtered from each VCF.

Results:

Recall

MuTect performed the best at classifying correct variants followed by Strelka, MuSE, and Somatic Sniper. This was consistent across allele frequency thresholds of 01, 0.2, 0.3, 0.4, and 0.5.

Coverage and Recall

One interesting finding – for the callers investigated, the ability to recall variants at lower frequencies showed a similar pattern. Each of the callers discovers more of the variants before plateauing at a recall ceiling at a certain coverage. Lower allele frequencies require more coverage before saturating for recall at a caller. 30-fold coverage was required to reach the plateau of 0.5 allele frequency variants, while 40-fold coverage was required for 0.1 allele frequency variants. Reliable detection of lower frequency variants presumably require still more coverage to reach a recall plateu.

Precision

All tools performed well at identifying relevant variants (>95% precision) regardless of tumor sequencing depth.

To get a more accurate view of the interplay between precision and recall, the harmonic mean of precision and recall (F-score) was computed for each output VCF by depth. MuTect had the best performance overall, followed by Strelka, and then MuSE, and Somatic Sniper. Runtime & Cost

Out of all the apps, Strelka finished most rapidly for the lowest cost. Compared to MuTect, Strelka did not score as high for precision or recall, but completed the analysis of single nucleotide variants in a fraction of the time.

To get a more detailed comparison between MuTect and Strelka, this 3-way venn diagram compares these tools to the truth set. Note, the false negatives called by MuTect are likely due to noise in the dataset.

To better visualize the differences between the callers, we converted the output of each of the callers into high-dimensional vectors in which each variant call in any of the samples is one of the dimensions. This format allows us to calculate the distances between each of the programs and with the truth set. This also allows us to use standard methods such as Mulitdimensional Scaling to convert these distances into positions in 2-D space (axes units are arbitrary, only relative position matter is the graph below).

Valid variant calling results are crucial as next-generation sequencing data is increasingly applied to the development of targeted cancer therapeutics. Our analysis of MuTect, MuSe, Strelka, and Somatic Sniper found that the best results with respect to precision and recall can be achieved by using MuTect. Strelka was also a top performer, and simultaneously reduced runtime and cost.

Need to detect variants in your dataset? Get started using these tools on DNAnexus today.

This research was performed by Nicholas Hill and Victoria Wang as part of their internship with DNAnexus. The project was supervised by Naina Thangaraj, Arkarachai Fungtammasan, Yih-Chii Hwang, Steve Osazuwa, and Andrew Carroll.

At Bio-IT World: Promoting Technological Innovation to Advance Precision Medicine

We are excited to join the 3,000+ researchers, clinicians, and pharmaceutical and IT professionals attending the Bio-IT World Conference in Boston next week. The DNAnexus team will be onsite and headquartered in booth #316, please stop by to learn how DNAnexus helps improve secondary analysis, facilitates collaboration, and provides a scalable and secure platform for genomic research. Register here to attend the conference.

A highlight of the event every year is the Bio-IT World Best Practices Awards. This prestigious award highlights outstanding examples of how technological innovation can be powerful forces of change in the life sciences. This year, our partner M2Gen is a Best Practices Award finalist!

M2Gen has partnered with 15 of the nation’s leading cancer centers via the Oncology Research Information Exchange Network (ORIEN) to deliver informatics-based solutions to accelerate therapy discovery and development. The DNAnexus cloud platform supports molecular data access, management, collaboration, and analysis for ORIEN. This cloud-based approach creates value for all stakeholders, impacting the point-of-care and driving basic cancer research in both academic centers and industry. The multiple categories of the Best Practices Awards will be announced live during the plenary session on Thursday May 25th.

Dr. Hongyue Dai, PhD, CSO of M2Gen, will be in the DNAnexus booth to answer questions and showcase M2Gen’s innovative cancer data network. Come by booth 316 at 10:00am ET on Thursday to learn more.

We will also be showcasing projects with our clinical and software partners. See our full list of activities below. Can’t make it to one of our events? Stop by booth 316 anytime during the conference, or email us to schedule a meeting with a member of our team.

Scaling the World’s Fastest Clinical Genomic Pipeline for Critical Care in Pediatrics
Narayanan Veeraraghavan, PhD, Director of IT, Rady Children’s Institute for Genomic Medicine
Wednesday, May 24, 10:00am-10:30am
DNAnexus Booth #316

Genomic Solutions in Microsoft Azure
Singer Ma, Scientific Operations Director, DNAnexus

Wednesday, May 24th, 2:30pm-3:00pm

Microsoft Booth #529

Rapid Variant Discovery with Sentieon
Brendan Gallagher, Business Development, Sentieon
Wednesday May 24, 3:30pm – 4:00pm
DNAnexus Booth #316

Meet & Greet with M2Gen
Hongyue Dai, PhD, Chief Scientific Officer, M2Gen

Thursday May 25, 10:00am-10:30am

DNAnexus Booth #316

Rady Children’s Quest to Finding That Needle in a Haystack

Rady Children’s Institute for Genomic Medicine (RCIGM), located in San Diego, has announced a pioneering effort to deliver life-changing genetic diagnoses for children suffering from rare diseases. Led by president and CEO, Dr. Stephen Kingsmore, Rady is building an end-to-end clinical whole genome data analysis solution, built on the DNAnexus Platform, for children’s hospitals nationally.

The impact of diagnosis by WGS is often life changing. The team routinely tests critically ill children for over 5,000 diseases, of which more than 500 have highly effective treatments. For example, if the test reveals a mutation in a gene involved in digestion, causing the inability to process a particular nutrient thereby leading to buildup of a poisonous byproduct, a simple change in diet can limit the effects of the disease. The sooner this condition can be diagnosed the less damage the child will suffer. In these cases, minutes literally matter.

Dr. Kingsmore’s vision is to ensure genome-powered diagnosis is accessible to every child who needs it. Building a world-class pipeline at a single hospital isn’t enough. RCIGM needed a solution that could scale and be deployed at institutions around the world. DNAnexus provides the technology and expertise that allows RCIGM to grow an innovative pediatric-focused genomics network, distribute its clinical tools and collaborate with colleagues in a secure and compliant environment.

This work was done as part of RCIGM’s collaboration with the The Newborn Sequencing In Genomic medicine and public HealTh (NSIGHT) program. NSIGHT addresses how genomic sequencing can replicate or augment known screening results for newborn disorders, what knowledge sequencing can provide for conditions not currently screened, and what additional clinical information could be learned from sequencing relevant to the clinical care of newborns. The NSIGHT program is funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), components of the National Institutes of Health.

DNAnexus provides a flexible platform that connects Fabric Genomics’ interpretation software and integrates seamlessly with Rady Children’s custom data interpretation portal. Users monitor jobs, organize and share data, and compare patients’ data to a diagnostic resource within the network. At DNAnexus, we are proud to support Dr. Kingsmore and RCGIM’s endeavor to prevent, diagnose, and treat childhood diseases through genomics research.