Comparison of Somatic Variant Calling Pipelines On DNAnexus

The detection of somatic mutations in sequenced cancer samples has become increasingly standard in research and clinical settings, as they provide insights into genomic regions which can be targeted by precision medicine therapies. Due to the heterogeneity of tumors, somatic variant calling is challenging, especially for variants at low allele frequencies. Researchers use common somatic variant call tools, including MuTect, MuSE, Strelka, and Somatic Sniper,  that detect somatic mutations by conducting paired comparisons between sequenced normal and tumorous tissue samples. Each of these variant callers differ in algorithms, filtering strategies, recommendations, and output. Thus we set out to compare how these individual apps perform on the DNAnexus Platform. Each app was evaluated for recall and precision, cost, and time to complete.  

To benchmark some of the common somatic variant calling tools available on the DNAnexus Platform, our team of scientists simulated synthetic cancer datasets at varying sequencing depths. DNA samples from the European Nucleotide Archive were obtained and mapped to the hs37d5 reference with the BWA-mem FASTQ read mapper on DNAnexus.

These samples were then merged into a single BAM file representing the normal sample. To obtain the tumor sample, synthetic variants were inserted into each individual sample with the BAMSurgeon app on DNAnexus. All simulated samples were then merged into one BAM file constituting the tumor sample. Both the synthetic tumor and normal BAM files had approximately 250X sequencing depth.The synthetic tumor BAM file was then downsampled into a range of sequencing depths. With the help of sambamba through the Swiss Army Knife application, these files were reduced to 5X, 10X, 15X, 20X, 30X, 40X, 50X, 60X, 90X, and 120X coverage files. The file representing the normal sample was downsampled into a 30X sequencing depth file.  Once the synthetic cancer dataset was created, the common somatic variant calling tools MuTect, MuSE, Strelka, and Somatic Sniper were run to detect single nucleotide variants. Upon completion, the high quality variants were filtered from each VCF.



MuTect performed the best at classifying correct variants followed by Strelka, MuSE, and Somatic Sniper. This was consistent across allele frequency thresholds of 01, 0.2, 0.3, 0.4, and 0.5.

Coverage and Recall

One interesting finding – for the callers investigated, the ability to recall variants at lower frequencies showed a similar pattern. Each of the callers discovers more of the variants before plateauing at a recall ceiling at a certain coverage. Lower allele frequencies require more coverage before saturating for recall at a caller. 30-fold coverage was required to reach the plateau of 0.5 allele frequency variants, while 40-fold coverage was required for 0.1 allele frequency variants. Reliable detection of lower frequency variants presumably require still more coverage to reach a recall plateu.


All tools performed well at identifying relevant variants (>95% precision) regardless of tumor sequencing depth.

To get a more accurate view of the interplay between precision and recall, the harmonic mean of precision and recall (F-score) was computed for each output VCF by depth. MuTect had the best performance overall, followed by Strelka, and then MuSE, and Somatic Sniper. Runtime & Cost

Out of all the apps, Strelka finished most rapidly for the lowest cost. Compared to MuTect, Strelka did not score as high for precision or recall, but completed the analysis of single nucleotide variants in a fraction of the time.

To get a more detailed comparison between MuTect and Strelka, this 3-way venn diagram compares these tools to the truth set. Note, the false negatives called by MuTect are likely due to noise in the dataset.

To better visualize the differences between the callers, we converted the output of each of the callers into high-dimensional vectors in which each variant call in any of the samples is one of the dimensions. This format allows us to calculate the distances between each of the programs and with the truth set. This also allows us to use standard methods such as Mulitdimensional Scaling to convert these distances into positions in 2-D space (axes units are arbitrary, only relative position matter is the graph below).

Valid variant calling results are crucial as next-generation sequencing data is increasingly applied to the development of targeted cancer therapeutics. Our analysis of MuTect, MuSe, Strelka, and Somatic Sniper found that the best results with respect to precision and recall can be achieved by using MuTect. Strelka was also a top performer, and simultaneously reduced runtime and cost.

Need to detect variants in your dataset? Get started using these tools on DNAnexus today.

This research was performed by Nicholas Hill and Victoria Wang as part of their internship with DNAnexus. The project was supervised by Naina Thangaraj, Arkarachai Fungtammasan, Yih-Chii Hwang, Steve Osazuwa, and Andrew Carroll.

Removing the NGS Analytics Data Bottleneck with Field-Programmable Gate Arrays (FPGA’s)

Edico Genome’s FPGA-backed DRAGEN Bio-IT Platform Now Available on DNAnexus

The following is a guest blog, written by our partners at Edico Genome.

With rapid adoption across a variety of practices, next-generation sequencing (NGS) is on track to become one of the largest producers of big data by 2025. While the integration of NGS poses exceptional breakthroughs in its applied practices, one major problem threatens its expansion: a lack of computing power to analyze the rapidly growing body of data.

Current projections calculate genomic data to continue doubling every seven months, a stark acceleration in comparison to Moore’s Law, which states CPU capabilities will double every two years. The void left in-between creates a bottleneck for genomics labs.

Designed to uncork this big data bottleneck, Edico Genome’s DRAGEN™ (Dynamic Read Analysis for Genomics) Platform leverages FPGA (Field-Programmable Gate Array) technology to provide customers with hardware-accelerated implementation of genome pipeline algorithms. Leveraging FPGAs, DRAGEN allows customers to analyze NGS data at unprecedented speeds with extremely high accuracy and unwavering dependability.

Uncorking the big data bottleneck with DRAGEN

In contrast to conventional CPU-based systems, which must execute lines of software code to perform an algorithmic function, FPGAs implement algorithms as logic circuits, providing an output almost instantaneously. By replicating these logic circuits thousands of times over, DRAGEN is able to achieve industry-leading speeds by allowing for massive parallelism, unlike CPUs, which are limited to running only one task per core. FPGAs are also fully reconfigurable, enabling customers to switch between functions and pipelines within seconds.

As a result, DRAGEN delivers high accuracy while functioning with industry-leading speed, efficiency, and parallelism. DRAGEN can process an entire human genome at 30x coverage in about 90 minutes, as compared to over 30 hours using a traditional CPU-based system, saving customers time and money. DRAGEN’s Genome Pipeline is now available on DNAnexus at a reduced trial rate until October 31, 2017. To sign up for exclusive promotional pricing, visit: .

At Bio-IT World: Promoting Technological Innovation to Advance Precision Medicine

We are excited to join the 3,000+ researchers, clinicians, and pharmaceutical and IT professionals attending the Bio-IT World Conference in Boston next week. The DNAnexus team will be onsite and headquartered in booth #316, please stop by to learn how DNAnexus helps improve secondary analysis, facilitates collaboration, and provides a scalable and secure platform for genomic research. Register here to attend the conference.

A highlight of the event every year is the Bio-IT World Best Practices Awards. This prestigious award highlights outstanding examples of how technological innovation can be powerful forces of change in the life sciences. This year, our partner M2Gen is a Best Practices Award finalist!

M2Gen has partnered with 15 of the nation’s leading cancer centers via the Oncology Research Information Exchange Network (ORIEN) to deliver informatics-based solutions to accelerate therapy discovery and development. The DNAnexus cloud platform supports molecular data access, management, collaboration, and analysis for ORIEN. This cloud-based approach creates value for all stakeholders, impacting the point-of-care and driving basic cancer research in both academic centers and industry. The multiple categories of the Best Practices Awards will be announced live during the plenary session on Thursday May 25th.

Dr. Hongyue Dai, PhD, CSO of M2Gen, will be in the DNAnexus booth to answer questions and showcase M2Gen’s innovative cancer data network. Come by booth 316 at 10:00am ET on Thursday to learn more.

We will also be showcasing projects with our clinical and software partners. See our full list of activities below. Can’t make it to one of our events? Stop by booth 316 anytime during the conference, or email us to schedule a meeting with a member of our team.

Scaling the World’s Fastest Clinical Genomic Pipeline for Critical Care in Pediatrics
Narayanan Veeraraghavan, PhD, Director of IT, Rady Children’s Institute for Genomic Medicine
Wednesday, May 24, 10:00am-10:30am
DNAnexus Booth #316

Genomic Solutions in Microsoft Azure
Singer Ma, Scientific Operations Director, DNAnexus

Wednesday, May 24th, 2:30pm-3:00pm

Microsoft Booth #529

Rapid Variant Discovery with Sentieon
Brendan Gallagher, Business Development, Sentieon
Wednesday May 24, 3:30pm – 4:00pm
DNAnexus Booth #316

Meet & Greet with M2Gen
Hongyue Dai, PhD, Chief Scientific Officer, M2Gen

Thursday May 25, 10:00am-10:30am

DNAnexus Booth #316