DNAnexus & TCGA: Reanalyzing the World’s Largest Pan-Cancer Initiative Dataset

The Cancer Genome Atlas (TCGA), a joint effort between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), was established in 2006 to create a detailed catalog of genetic mutations responsible for cancer using next-generation sequencing.  Over the years, TCGA collaborators have generated over 2.5 petabytes of data collected from nearly 11,000 patients, describing 34 different tumor types (including 10 rare cancers) based on paired tumor and normal tissue sets.

TCGtcgaA has been an incredible learning process for the genomics community. In the start of this initiative, for example,  researchers didn’t know as much about mutation calling.  Over the past decade, however, we’ve improved sensitivity of variant callers and have a much greater number of samples to assess genomic markers.

“In addition, mutation calling for TCGA samples was primarily done for individual tumor types, with projects using different mutation callers or different versions of the callers, meaning the data wasn’t uniform,” said Carolyn Hutter, PhD, Program Director, Division of Genomic Medicine at the NHGRI. “We now believe the best way to do analysis is to have a uniform set of calls generated by multiple mutation callers, with quality control and filtering, across multiple cancer types. That’s why the TCGA team decided to go back and recall the over 10,000 exomes in TCGA and produce this multi-caller somatic mutation dataset.”

Resequencing the TCGA dataset was a massive undertaking. The necessary compute resources for a large-scale project of this nature was not in place at TCGA member institutes. The DNAnexus Platform provided important requirements for the mutation calling project, including patient security, a scalable environment that could handle tens of thousands of exomes, and reproducibility of results. Over a four-week period approximately 1.8 million core-hours of computational time were used to process 400 TB of data, yielding reproducible results.

“Realigning TCGA data with a single methodology across new standardized mutation callers will make the tumor data much more relevant to the community. The DNAnexus Platform allowed us to create a uniform  and analytical treatment through version-controlled analyses and tools that would have been challenging to replicate at any single facility in a reasonable time frame,” said David Wheeler, PhD, Professor, Department of Molecular and Human Genetics at Baylor College of Medicine.  “With this standardized set of mutation calls obtained by several callers, we’ll be able to identify genetic alterations contributing to cancer that are shared between tumors independent of the tissue-of-origin. We are optimistic that having access to such information will spur advancement in precision medicine.”

Key TCGA results to date have been:

  • Improved understanding of the genomic underpinnings of cancer
  • Reclassification of cancer by identifying tumor subtypes with distinct sets of genomic alterations
  • Insights into treatment approaches based on currently available therapies or used to help with drug development.

The value of this reanalysis under a single methodology across new standardized mutation callers allows for the samples to be compared across cancer types. This will facilitate further new findings, such as if one individual’s breast cancer may show greater genomic similarity to a subtype of ovarian cancer than to other types of breast cancer. In the future, we believe patients will be treated based on their genomic profile rather than the origin of  their cancer. DNAnexus is proud to collaborate with TCGA in making this important dataset more useful to the cancer research community.

Researchers now have access to the TCGA pipelines via the DNAnexus Platform in addition to a GitHub repository. DNAnexus works to ensure mechanisms for data access requests and vending data to approved requestors meets security standards for dbGaP and TCGA data in the cloud.

Review the latest NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy, where the NIH cited DNAnexus Compliance White Paper.

Ten-Fold PacBio Sequel Call Set Proves Affordable & Effective in Identifying Structural Variants

Pacific Biosciences, a key partner of DNAnexus, has released the first public Sequel™ dataset of NA12878. This is a 10-fold coverage set featuring 32.8Gb of data, with an N50 read length of 11.8kb. Generated by PacBio’s new Sequel System, this dataset was used to demonstrate the robust ability of even low coverage long-read data to discover novel structural variants. The Sequel System is smaller, faster, and provides higher throughput, delivering around 7X the amount of data as the PacBio RS II.

screen-shot-2016-11-28-at-10-21-24-amThe Sequel System is half the cost of a PacBio RS II, five times faster and produces seven times as much data per SMRT Cell.  We believe that with these improvements the Sequel System is poised to open up long-read sequencing to a broader audience.  It will allow access to more robust applications ranging from genome and transcriptome assembly to variant detection. This is demonstrated by the Parliament Suite on DNAnexus, where combining low coverage PacBio reads with a short read dataset can significantly improve both the accuracy and the number of structural variant calls compared to short reads alone. In addition, 10-fold PacBio sequencing of NA12878 has been shown to recall 84% of known structural variants (SV) and identifies thousands more not previously seen in short reads by using SV tool, PBHoney.

Aside from structural variation, PacBio long reads have been used for robust, high quality de novo genome and transcriptome assemblies. Additionally, with instrument and chemistry improvements made for the Sequel System, the cost for generating a 50-fold coverage human dataset for resequencing and de novo assembly is expected to decrease dramatically.

Despite sequencing costs dropping, de novo genome assembly and structural variant calling remain complex tasks; ones that can require massive computational resources to weave long reads into a final, polished assembly or to run structural variation detection methods across multiple data types.  For this reason, PacBio has selected DNAnexus to be its cloud bioinformatics partner, providing bioinformatics support to its global customers. The SMRT® Analysis Suite v3.1.1 is available on the DNAnexus Platform and has been optimized for the cloud environment, as well as other long read analysis tools, such as PBHoney, PBJelly, and Parliament.

Curious about PacBio tools and services on DNAnexus? Schedule a 30-minute scientific consultation.


De novo assemblies of individual human genomes via the PacBio RS II at high-fold coverage have revealed tens of thousands of structural variants, many of which are accessible only through SMRT Sequencing. In an effort to optimize SV discovery methods, PacBio set out to understand what SV’s could be identified in a well-studied human sample NA12878 from low-fold coverage sequencing on the new Sequel System. To create the NA12878 Sequel dataset, PacBio generated approximately 10-fold coverage of the NA12878 sample on the Sequel System, which comes to about $5,000 in sequencing cost. The newly generated long reads were mapped to GRCh37 human reference using NGM-LR, and structural variants were called with PBHoney.

The output calls were compared to a “truth set” generated as a merged set between the 1000 Genomes Project and Genome in a Bottle NA12878 sets, both of which were analyzed using short read technology at much higher coverage. The low coverage PacBio 10-fold Sequel System set recalled 86% of truth set deletions and 81% of truth set insertions. Above and beyond, the 10-fold Sequel call set identified thousands of insertions and deletions not found in the short read truth set, with over 66% of these novel structural variants verified using a FALCON-Unzip 60-fold PacBio RS II de novo assembly.

This dataset demonstrates that low coverage Sequel reads can be used for accurate variant calling as well as novel structural variant identification, all of which is now available at a fraction of the cost with the new Sequel System. Through the partnership with DNAnexus, you can recreate the analysis performed on the NA12878 10-fold set. These tools and the generated dataset can be found on DNAnexus under Featured Projects.

Questions? Contact us directly at: pacbio@dnanexus.com.

Leading Genome Research Center Migrates to DNAnexus on Azure

DNAnexus on Microsoft AzureToday we announced that the trusted DNAnexus genome informatics and data management platform is now also available on Microsoft Azure, Microsoft’s open, flexible, enterprise-grade cloud computing platform. Leveraging Azure, DNAnexus provides organizations a single, secure, scalable, and collaborative platform to accelerate the application of genomics within healthcare and research. The Stanford Center for Genomics and Personalized Medicine (SCGPM) is the first organization to access DNAnexus on Azure.

scgpmA key advantage to conducting genomic research in the cloud is the enhanced collaboration facilitated by data accessibility, consistency, and scalability. SCGPM researchers already have existing collaborations on the DNAnexus Platform hosted by Amazon Web Services, by extending adoption of DNAnexus on Azure means that researchers can collaborate even more widely. By leveraging DNAnexus on Azure’s powerful data-handling capabilities, a distributed network of scientists and researchers have secure access to terabytes of data through a common user interface.

DNAnexus and Microsoft are both valued partners to Stanford’s core sequencing facility. SCGPM and David Heckerman, distinguished scientist and director of Microsoft Genomics, have been in close collaboration for years. By extending the DNAnexus Platform to Azure, it is now easier for SCGPM researchers to work closely with David’s team. We believe we are just seeing the tip of the iceberg in terms of the potential for medical discovery.

DNAnexus is proud to support SCGPM on its mission to translate genomics into patient-centered medicine, and we look forward to enabling the discoveries that unfold.

DNAnexus on Microsoft AzureInnovation Through Collaboration

Through additional partnerships, Microsoft recently developed computational methods to accelerate the best practices pipeline for genome resequencing sevenfold. By improving the efficiency of the Burrows-Wheeler Aligner (BWA) and Genome Analysis Toolkit (GATK), researchers and medical professionals are able to get actionable results in just four hours, compared to the previous twenty-eight. This is critical for medical professionals to accelerate diagnosis and treatment for patients.

Genomic sequencing and analysis has become a key component of the diagnosis and treatment of cancer and other genetic conditions. This effort has both relied on and stimulated innovative technologies. At DNAnexus, we firmly believe that in order to continue innovating and further break down the technical barriers to disease, community collaboration is essential. The sharing of data and ideas between organizations – and even industries – spurs the innovation critical to medical breakthroughs. Microsoft is a global leader in technological innovation, and by partnering with leading research centers, universities, and the private sector, it is poised to make great contributions to the genomics revolution.

The DNAnexus Platform sits at the forefront of cloud-based data security, compliance, and controlled access. By co-developing with DNAnexus, Microsoft will be able to deploy their tools into an investigative environment while leveraging extensive research experience. We are excited to be collaborating with Microsoft and to offer these cutting-edge bioinformatics tools available to the genomics community via the DNAnexus Platform in the future.

Facilitating Collaboration on DNAnexus

The need for enhanced collaboration is a trend in the genomics industry we have been following for a while. DNAnexus equips end-users with out-of-the-box clinical compliance and streamlines communication between healthcare providers, reducing information silos for more efficient collaboration.

However, this notion of partnership goes deeper than groups of scientists working together to parse through datasets. Innovation and exploration are best served through collaboration, thus successful innovation in the genomics industry also relies on disparate industries working together towards a common goal. By tapping into the genomics network, the community is able to learn from each other to advance research, leading to accelerated medicine and tailored patient care.

DNAnexus is excited about the opportunity to partner with Microsoft, given their commitment to advancing the field of genomics, and their depth and breadth of experience offering solutions to the healthcare industry.