Case Study: Calling Somatic Variants with Sentieon on DNAnexus

 Editor’s Note: This blog post was written by Don Freed, Bioinformatics Scientist at Sentieon. Email him at

Sentieon on DNAnexus provides clinicians and researchers an easy-to-use interface to process genomics data with state-of-the art tools in the cloud. In my previous post, we showed how the Sentieon Rapid DNAseq app is capable of processing an entire trio from FASTQ to VCF in about an hour. Today we’ll demonstrate the processing of targeted and exome-capture sequencing of tumor-normal paired samples.

Cancer is a genomic disease, arising as the result of sequential mutations in somatic cells. These mutations change the properties of cancer cells, leading to uncontrolled growth, cell death resistance, and eventually tumor formation and metastasis. Clinicians and researchers are very interested in digging into the genetic changes that give rise to cancer with next-generation sequence (NGS) data. NGS data allows researchers to better understanding the process of cancer and tumor formation and allows clinicians to determine the best course of treatment for cancer patients. As a result, the sequencing of paired tumor-normal samples is a common study design.

Somatic variant calling is a difficult task due to tumor heterogeneity and low abundance of tumor cells.  Two of the more popular tools to address this task are MuTect and MuTect2 developed by the Broad Institute. High coverage across the protein coding sequences of the tumor genome is essential for the sensitive detection of low frequency somatic mosaic mutations. In targeted sequencing experiments, coverage of the targeted regions can exceed 1,000 or even 10,000 reads. Processing such high-coverage datasets is computationally intensive and by default MuTect (1.1.5) and MuTect2 (3.7) downsample the data to 1,000x and 500x, respectively, for improved runtime performance. Unfortunately, downsampling may also result in missed variant calls.

The Sentieon TNseq app on DNAnexus does not perform downsampling, enabling you to achieve the highest sensitivity for low frequency variants present in your sample. TNseq is also deterministic, given the same input data, the same variants will be called. In addition, the app produces results that match MuTect and MuTect2, but with a drastically improved runtime and cost savings.

Running TNseq Tumor-Normal Analysis on DNAnexus

For this case study, we used publicly available tumor-normal paired samples from SRP079186 and whole-genome sequence data from the Texas Cancer Research Biobank (TCRB). According to the project description in SRA, the tumor samples in SRP079186 consist of sequence data from small cell carcinoma mixed with squamous cell carcinoma of the esophagus, while the TCRB data are described in detail in this peer-reviewed publication. We obtained the FASTQ files from the European Nucleotide Archive and directly from the TCRB website and uploaded them into DNAnexus. To process the data, we used the Sentieon TNseq FASTQ to VCF app on DNAnexus. The pipelines produced VCF files of somatic variants called in the exome capture and targeted sequencing data. Alignment, duplicate marking, indel realignment, base-quality score recalibration, indel co-realignment and variant calling were performed with the DNAnexus workflows shown below.

Somatic variant calling on the targeted sequencing sample took only 37 minutes, while somatic variant calling from the paired whole-exome sequenced samples took 10 hours and 18 minutes. Variant calling in the 60x tumor and the 30x normal TCRB whole-genome sequence sample took 16 hours and 45 minutes.

Using Sentieon’s TNseq FASTQ to VCF app, we were able to identify six and ten somatic mosaic mutations in the targeted sequence data, 381 and 739 somatic mosaic mutations in the whole-exome sequence data, and 5,812 and 2,171 somatic mutations in the whole-genome sequence data with TNsnv and TNhaplotyper, respectively. The number of non-silent mutations was 72 and 251 from the whole-exome sequence data as determined by TNsnv and TNhaplotyper, respectively. Other whole-exome sequencing studies of similar cancers have shown that 82 non-silent somatic mutations occur on average in these cancers. This is consistent with our findings given the differences in somatic variant calling pipelines used. Coverage of the targeted regions in the tumor sample exceeded 6,500x at some positions. MuTect2, with default settings, would have discarded more than 92% of the data at these sites, possibly resulting in false-negative or false-positive variant calls.

Digging deeper into the whole-exome data, we find p.R248Q and p.R158H somatic mutations in the gene TP53 encoding the protein p53, perhaps the best studied tumor suppressor protein. p53 functions as a key cell cycle regulator, stopping cell growth in the case of genomic damage. In many cancers p53 is inactivated as a method of bypassing cellular senesce. Germline inactivating mutations in TP53 cause a cancer predisposition syndrome called Li-Fraumeni syndrome. Somatic inactivating mutations in TP53 are expected in this sample; TP53 is the most frequent target of somatic mutation in esophageal squamous cell carcinomas and about half of all cancers have TP53 mutations. If you are interested in learning more about our analysis, you can see our workflow in the “Sentieon FASTQ to VCF (Tumor/Normal)” project on DNAnexus.

With the secure and collaborative cloud-based DNAnexus Platform and Sentieon TNseq FASTQ to VCF app, researchers and clinicians can rapidly and confidently identify somatic mutations from thousands of paired samples. Register here for a license-free trial of Sentieon tools on DNAnexus, available through April 7th.

ACMG: A Look at Applying Genomic Data to Clinical Reports

The annual American College of Medical Geneticists (ACMG) conference meets this week (March 21-25, 2017) in Phoenix, Arizona, providing an outstanding forum to learn how genetics and genomics are being integrated into medical and clinical practice. Eric Venner, from the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine, will present the following poster (Abstract Number 368): Generating Clinical Reports from Genomic Data on the Cloud-based Neptune Platform  on Friday March 24th 10:30AM-12:00PM

In order to meet the demand for timely and cost-efficient clinical reporting, HGSC developed Neptune, an automated analytical platform to sign out and deliver clinical reports. The process starts when a clinical site uploads a test requisition to the HIPAA compliant environment on DNAnexus. Next, de-identified samples are analyzed with HGSC’s variant calling pipeline, Mercury, which feeds into the reporting pipeline, Neptune. Variants of putative clinical relevance are identified for manual review and possible addition to a VIP database of clinically relevant variation. The VIP database currently holds 20,872 SNPs and 3,946 indels, as well as a curated set of copy number variants.

Neptune’s manual review interface was designed with a clinical geneticist in mind. Users can login, curate variants in their samples, update the VIP database accordingly and create clinical reports. Early applications include reporting for the NIH Electronic Medical Records and Genomics (eMERGE) Network III where more than 14,500 samples and a panel of 109 genes will be processed over the course of three years.

eMERGE is a national network that combines DNA biorepositories with electronic medical record (EMR) systems for large scale, high-throughput genetic research to support investigating how personalized treatments impact patient care. Research so far has led to significant discoveries across a wide range of diseases, including prostate cancer, leukemia, and diabetes.  DNAnexus and the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine worked to build the eMERGE Commons, a data repository where genomic data are merged with patient electronic medical records (EMR), as well as analysis results and bioinformatics tools to be accessed and applied by eMERGE researchers.

Updated DNAnexus Impact Assessment for Cloudbleed: No evidence of exploitation.

As described in our February 27, 2017 blog post regarding the Cloudflare information leak (“Cloudbleed”), a  bug within the code running on Cloudflare edge servers was discovered by a Google security researcher.

Upon further investigation into the use of Cloudflare on DNAnexus we found, on February 27th at 2:39 PM PST, that contrary to what we had indicated in our blog post, HTTP requests to served by Cloudflare edge servers in some cases included session tokens with authentication information. We revoked all customer session tokens at 5:06 PM PST that same day, at which point all requests to DNAnexus required re-authentication. All existing tokens were unusable after this time.

On February 23rd Cloudflare provided their most recent update and stated that there was no evidence of exploitation; there have been no updates since that deviate from this information. Additionally, Cloudflare has completed analysis of edge server log data, and on March 3rd confirmed that was not found to have been impacted.

Our CDN usage design has been reviewed and we continue to believe no customer has been impacted by the incident. Any potential new exposure has been eliminated and there continues to be no evidence of exploitation.

We know how critical information security is to our customers so if you have any questions about your account, please do not hesitate to contact our customer support team at