Case Study: Calling Somatic Variants with Sentieon on DNAnexus

 Editor’s Note: This blog post was written by Don Freed, Bioinformatics Scientist at Sentieon. Email him at

Sentieon on DNAnexus provides clinicians and researchers an easy-to-use interface to process genomics data with state-of-the art tools in the cloud. In my previous post, we showed how the Sentieon Rapid DNAseq app is capable of processing an entire trio from FASTQ to VCF in about an hour. Today we’ll demonstrate the processing of targeted and exome-capture sequencing of tumor-normal paired samples.

Cancer is a genomic disease, arising as the result of sequential mutations in somatic cells. These mutations change the properties of cancer cells, leading to uncontrolled growth, cell death resistance, and eventually tumor formation and metastasis. Clinicians and researchers are very interested in digging into the genetic changes that give rise to cancer with next-generation sequence (NGS) data. NGS data allows researchers to better understanding the process of cancer and tumor formation and allows clinicians to determine the best course of treatment for cancer patients. As a result, the sequencing of paired tumor-normal samples is a common study design.

Somatic variant calling is a difficult task due to tumor heterogeneity and low abundance of tumor cells.  Two of the more popular tools to address this task are MuTect and MuTect2 developed by the Broad Institute. High coverage across the protein coding sequences of the tumor genome is essential for the sensitive detection of low frequency somatic mosaic mutations. In targeted sequencing experiments, coverage of the targeted regions can exceed 1,000 or even 10,000 reads. Processing such high-coverage datasets is computationally intensive and by default MuTect (1.1.5) and MuTect2 (3.7) downsample the data to 1,000x and 500x, respectively, for improved runtime performance. Unfortunately, downsampling may also result in missed variant calls.

The Sentieon TNseq app on DNAnexus does not perform downsampling, enabling you to achieve the highest sensitivity for low frequency variants present in your sample. TNseq is also deterministic, given the same input data, the same variants will be called. In addition, the app produces results that match MuTect and MuTect2, but with a drastically improved runtime and cost savings.

Running TNseq Tumor-Normal Analysis on DNAnexus

For this case study, we used publicly available tumor-normal paired samples from SRP079186 and whole-genome sequence data from the Texas Cancer Research Biobank (TCRB). According to the project description in SRA, the tumor samples in SRP079186 consist of sequence data from small cell carcinoma mixed with squamous cell carcinoma of the esophagus, while the TCRB data are described in detail in this peer-reviewed publication. We obtained the FASTQ files from the European Nucleotide Archive and directly from the TCRB website and uploaded them into DNAnexus. To process the data, we used the Sentieon TNseq FASTQ to VCF app on DNAnexus. The pipelines produced VCF files of somatic variants called in the exome capture and targeted sequencing data. Alignment, duplicate marking, indel realignment, base-quality score recalibration, indel co-realignment and variant calling were performed with the DNAnexus workflows shown below.

Somatic variant calling on the targeted sequencing sample took only 37 minutes, while somatic variant calling from the paired whole-exome sequenced samples took 10 hours and 18 minutes. Variant calling in the 60x tumor and the 30x normal TCRB whole-genome sequence sample took 16 hours and 45 minutes.

Using Sentieon’s TNseq FASTQ to VCF app, we were able to identify six and ten somatic mosaic mutations in the targeted sequence data, 381 and 739 somatic mosaic mutations in the whole-exome sequence data, and 5,812 and 2,171 somatic mutations in the whole-genome sequence data with TNsnv and TNhaplotyper, respectively. The number of non-silent mutations was 72 and 251 from the whole-exome sequence data as determined by TNsnv and TNhaplotyper, respectively. Other whole-exome sequencing studies of similar cancers have shown that 82 non-silent somatic mutations occur on average in these cancers. This is consistent with our findings given the differences in somatic variant calling pipelines used. Coverage of the targeted regions in the tumor sample exceeded 6,500x at some positions. MuTect2, with default settings, would have discarded more than 92% of the data at these sites, possibly resulting in false-negative or false-positive variant calls.

Digging deeper into the whole-exome data, we find p.R248Q and p.R158H somatic mutations in the gene TP53 encoding the protein p53, perhaps the best studied tumor suppressor protein. p53 functions as a key cell cycle regulator, stopping cell growth in the case of genomic damage. In many cancers p53 is inactivated as a method of bypassing cellular senesce. Germline inactivating mutations in TP53 cause a cancer predisposition syndrome called Li-Fraumeni syndrome. Somatic inactivating mutations in TP53 are expected in this sample; TP53 is the most frequent target of somatic mutation in esophageal squamous cell carcinomas and about half of all cancers have TP53 mutations. If you are interested in learning more about our analysis, you can see our workflow in the “Sentieon FASTQ to VCF (Tumor/Normal)” project on DNAnexus.

With the secure and collaborative cloud-based DNAnexus Platform and Sentieon TNseq FASTQ to VCF app, researchers and clinicians can rapidly and confidently identify somatic mutations from thousands of paired samples. Register here for a license-free trial of Sentieon tools on DNAnexus, available through April 7th.

Case Study: Trio Analysis with Sentieon Rapid DNAseq on DNAnexus

Editor’s Note: This blog post is written by Don Freed, Bioinformatics Scientist at Sentieon. Email him at 


At Sentieon we work hard to create the most efficient, accurate and robust tools for variant calling. Thanks to our partnership with DNAnexus, we are sharing the benefits of this hard work with you.  Through April 7th, we are offering license-free access to Sentieon pipelines on DNAnexus, request access today to see how using Sentieon DNAseq you can obtain identical results to GATK at a fraction of the cost. In addition, Sentieon’s variant calling is deterministic; given identical input data, Sentieon will always call the same set of variants. Utilizing Sentieon’s tools on the DNAnexus Platform, clinicians and researchers can perform accurate and cost-effective analysis of petabyte-scale datasets with ease, seamlessly running analyses of an arbitrary number of samples simultaneously in the cloud.

Many of our customers use Sentieon tools to call variants from human samples. The typical Human genome contains some 4.5 million variants relative to the Human reference genome. While almost all of these variants are inherited, every individual has approximately 50 de novo variants, which occur uniquely in their genome. De novo variants are some of the most interesting genetic variants to study, they frequently cause rare sporadic diseases such as KBG syndrome, and have been implicated in complex disorders such as autism.

In this post, we’ll demonstrate the power of running Sentieon tools on DNAnexus by performing alignment with BWA, duplicate removal, base-quality score recalibration, indel realignment, haplotype-based variant calling and joint genotyping of a 30x whole-genome trio. Using these data, it is possible to identify de novo variants, the parental origin of some interesting inherited mutations, and examine the carrier status of this individual for rare recessive mutations. With the Rapid DNAseq app on DNAnexus, processing an entire trio takes about an hour. Whether you have a cohort of three or 3,000, by leveraging the power of the DNAnexus Platform and the scalability of the cloud, any size cohort can be processed incredibly fast.

Running analyses on DNAnexus

For this trio analysis, we used data from the Illumina Platinum Genomes dataset for individuals NA12878, NA12891, and NA12892 downsampled to 30x. The original fastq files can be found at the European Nucleotide Archive. To process the data, we used the Sentieon rapid DNAseq app on DNAnexus. We called variants in GVCF mode and input the gVCF files into the Sentieon GVCFtyper resulting in a single multi-sample VCF file for the entire trio. We easily accomplished this by using the DNAnexus workflow shown below.

In total the analysis took just 73 minutes.

We performed the same analysis with the original 50x dataset in one hour and 46 minutes. Runtimes scale approximately linearly to the input coverage.

We identified 2,458 de novo mutations in NA12878, well above the expected 50, although this increase has been previously attributed to primary cell somatic mutations or mutations introduced during immortalization and subsequent passage of the sequenced cells. We can see that NA12878 is heterozygous for both rs2472297 and rs6968865, which have been associated with increased coffee consumption.

Utilizing the DNAnexus cloud-based platform and Sentieon tools, our rapid DNAseq and joint genotyping runtimes easily scale to thousands of samples. You can view everything we ran in this public project: Rapid trio genotyping.

Register here for a free trial of the rapid DNAseq tool.

Innovation Fueled by Collaboration and Regulatory Science

In mid-2015 the Food and Drug Administration’s (FDA) Office of Health Informatics awarded DNAnexus a research and development contract to build precisionFDA, an online, cloud-based platform for sharing genomic information. Since its launch, more than 2,000 members of the next-generation sequencing (NGS) community have contributed to this resource by sharing and comparing biomedical data, software tools, and testing methodologies.

It falls under the responsibility of the FDA to ensure new medical treatments and tests meet a high standard for safety and efficacy, while working to get advances to market as quickly as possible. Following the announcement of President Obama’s Precision Medicine Initiative, the genomics community saw an increase in the use of NGS-based technologies in diagnostics, yet no standardized way to evaluate the accuracy of those tests. If new diagnostics were to be developed based on the broad applications of NGS, the approaches needed to be understood, and proved reliable, before they could be applied in clinical contexts.

The FDA took a forward-thinking approach to the regulation of genomic-based technologies and sponsored the development of the precisionFDA platform to, in the words of FDA leaders, “foster innovation and develop regulatory science around NGS tests,” and accelerate the implementation of precision medicine. Instead of government regulators establishing and imposing a set of performance standards for NGS tests with the typical top-down approach, precisionFDA seeks to empower the genomics community to develop regulatory science, through a collaborative and secure online platform.

The collaborative nature of precisionFDA lets researchers perform analyses on the same datasets, compare approaches, figure out what is successful, and determine where refinements can be made. The platform provides a flexible environment for test developers to leverage the findings from these collaborations to evaluate the accuracy and reproducibility of NGS analysis workflows, and share those results with the FDA and the rest of the community. The power of this approach is that the FDA remains at the epicenter of ongoing discussions, enabling the community to continue innovating, while keeping a pulse on the rapidly evolving genomics research space.

Robert Califf, former FDA Commissioner, penned an op-ed piece on his way out of office: How The FDA Will Help Lead the Next Medical Revolution. Califf believes that with precisionFDA, the agency can simultaneously meet the goals of protecting patients and advancing genomic medicine. Regulatory oversight can often be seen as a hindrance to innovation in healthcare, but the former commissioner believes that with this novel approach to regulation, the FDA will play a big role in realizing the potential of basing an individual’s’ treatment plan on their unique characteristics and genetic profile.

PrecisionFDA was founded upon the principles of collaboration and creating networks of stakeholders from industry, academia, and government. This platform is a successful example of how innovative regulation can spur progress by giving the key community stakeholders the ability to work together to define regulatory science.

In recent years, improvements in NGS technology have enhanced our ability to interrogate the human genome with high-specificity and bring those insights together with clinical patient data, which has pushed us closer to delivering on the promise of precision medicine. In order to keep pace with these technological advancements, it is crucial to harness the network effect of scientific collaboration. By empowering the community members with regulatory input, innovation can be stimulated instead of suppressed, and these innovations in turn will improve upon the quality of genomic tests and lead to advancements in health outcomes for patients.

George Asimenos, VP at DNAnexus will be presenting on precisionFDA at Molecular Med Tri-Conference in San Francisco as part of the Best Practice in Personalized and Translational Medicine short course. Hear the presentation Monday February 20th from 8am-11am.

Learn more and get involved at