DNAnexus & TCGA: Reanalyzing the World’s Largest Pan-Cancer Initiative Dataset

The Cancer Genome Atlas (TCGA), a joint effort between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), was established in 2006 to create a detailed catalog of genetic mutations responsible for cancer using next-generation sequencing.  Over the years, TCGA collaborators have generated over 2.5 petabytes of data collected from nearly 11,000 patients, describing 34 different tumor types (including 10 rare cancers) based on paired tumor and normal tissue sets.

TCGtcgaA has been an incredible learning process for the genomics community. In the start of this initiative, for example,  researchers didn’t know as much about mutation calling.  Over the past decade, however, we’ve improved sensitivity of variant callers and have a much greater number of samples to assess genomic markers.

“In addition, mutation calling for TCGA samples was primarily done for individual tumor types, with projects using different mutation callers or different versions of the callers, meaning the data wasn’t uniform,” said Carolyn Hutter, PhD, Program Director, Division of Genomic Medicine at the NHGRI. “We now believe the best way to do analysis is to have a uniform set of calls generated by multiple mutation callers, with quality control and filtering, across multiple cancer types. That’s why the TCGA team decided to go back and recall the over 10,000 exomes in TCGA and produce this multi-caller somatic mutation dataset.”

Resequencing the TCGA dataset was a massive undertaking. The necessary compute resources for a large-scale project of this nature was not in place at TCGA member institutes. The DNAnexus Platform provided important requirements for the mutation calling project, including patient security, a scalable environment that could handle tens of thousands of exomes, and reproducibility of results. Over a four-week period approximately 1.8 million core-hours of computational time were used to process 400 TB of data, yielding reproducible results.

“Realigning TCGA data with a single methodology across new standardized mutation callers will make the tumor data much more relevant to the community. The DNAnexus Platform allowed us to create a uniform  and analytical treatment through version-controlled analyses and tools that would have been challenging to replicate at any single facility in a reasonable time frame,” said David Wheeler, PhD, Professor, Department of Molecular and Human Genetics at Baylor College of Medicine.  “With this standardized set of mutation calls obtained by several callers, we’ll be able to identify genetic alterations contributing to cancer that are shared between tumors independent of the tissue-of-origin. We are optimistic that having access to such information will spur advancement in precision medicine.”

Key TCGA results to date have been:

  • Improved understanding of the genomic underpinnings of cancer
  • Reclassification of cancer by identifying tumor subtypes with distinct sets of genomic alterations
  • Insights into treatment approaches based on currently available therapies or used to help with drug development.

The value of this reanalysis under a single methodology across new standardized mutation callers allows for the samples to be compared across cancer types. This will facilitate further new findings, such as if one individual’s breast cancer may show greater genomic similarity to a subtype of ovarian cancer than to other types of breast cancer. In the future, we believe patients will be treated based on their genomic profile rather than the origin of  their cancer. DNAnexus is proud to collaborate with TCGA in making this important dataset more useful to the cancer research community.

Researchers now have access to the TCGA pipelines via the DNAnexus Platform in addition to a GitHub repository. DNAnexus works to ensure mechanisms for data access requests and vending data to approved requestors meets security standards for dbGaP and TCGA data in the cloud.

Review the latest NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy, where the NIH cited DNAnexus Compliance White Paper.