DNAnexus & TCGA: Reanalyzing the World’s Largest Pan-Cancer Initiative Dataset

The Cancer Genome Atlas (TCGA), a joint effort between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), was established in 2006 to create a detailed catalog of genetic mutations responsible for cancer using next-generation sequencing.  Over the years, TCGA collaborators have generated over 2.5 petabytes of data collected from nearly 11,000 patients, describing 34 different tumor types (including 10 rare cancers) based on paired tumor and normal tissue sets.

TCGtcgaA has been an incredible learning process for the genomics community. In the start of this initiative, for example,  researchers didn’t know as much about mutation calling.  Over the past decade, however, we’ve improved sensitivity of variant callers and have a much greater number of samples to assess genomic markers.

“In addition, mutation calling for TCGA samples was primarily done for individual tumor types, with projects using different mutation callers or different versions of the callers, meaning the data wasn’t uniform,” said Carolyn Hutter, PhD, Program Director, Division of Genomic Medicine at the NHGRI. “We now believe the best way to do analysis is to have a uniform set of calls generated by multiple mutation callers, with quality control and filtering, across multiple cancer types. That’s why the TCGA team decided to go back and recall the over 10,000 exomes in TCGA and produce this multi-caller somatic mutation dataset.”

Resequencing the TCGA dataset was a massive undertaking. The necessary compute resources for a large-scale project of this nature was not in place at TCGA member institutes. The DNAnexus Platform provided important requirements for the mutation calling project, including patient security, a scalable environment that could handle tens of thousands of exomes, and reproducibility of results. Over a four-week period approximately 1.8 million core-hours of computational time were used to process 400 TB of data, yielding reproducible results.

“Realigning TCGA data with a single methodology across new standardized mutation callers will make the tumor data much more relevant to the community. The DNAnexus Platform allowed us to create a uniform  and analytical treatment through version-controlled analyses and tools that would have been challenging to replicate at any single facility in a reasonable time frame,” said David Wheeler, PhD, Professor, Department of Molecular and Human Genetics at Baylor College of Medicine.  “With this standardized set of mutation calls obtained by several callers, we’ll be able to identify genetic alterations contributing to cancer that are shared between tumors independent of the tissue-of-origin. We are optimistic that having access to such information will spur advancement in precision medicine.”

Key TCGA results to date have been:

  • Improved understanding of the genomic underpinnings of cancer
  • Reclassification of cancer by identifying tumor subtypes with distinct sets of genomic alterations
  • Insights into treatment approaches based on currently available therapies or used to help with drug development.

The value of this reanalysis under a single methodology across new standardized mutation callers allows for the samples to be compared across cancer types. This will facilitate further new findings, such as if one individual’s breast cancer may show greater genomic similarity to a subtype of ovarian cancer than to other types of breast cancer. In the future, we believe patients will be treated based on their genomic profile rather than the origin of  their cancer. DNAnexus is proud to collaborate with TCGA in making this important dataset more useful to the cancer research community.

Researchers now have access to the TCGA pipelines via the DNAnexus Platform in addition to a GitHub repository. DNAnexus works to ensure mechanisms for data access requests and vending data to approved requestors meets security standards for dbGaP and TCGA data in the cloud.

Review the latest NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy, where the NIH cited DNAnexus Compliance White Paper.

The U.S. Cancer Moonshot and a Culture of Collaboration

Yesterday, United States Vice President Joe Biden hosted the National Cancer Moonshot Summit. Scientists, oncologists, donors, and patients convened for a daylong conference intended to pick up the pace of research towards curing cancer. Rather than focusing on one specific type of cancer, the conference broadly discussed more than 100 types of cancer; emphasizing strategies for prevention, early detection, wide access to treatment, and encouraging collaboration among researchers. You can read a first-hand account from our CMO, David Shaywitz, here.

As part of the Moonshot effort, DNAnexus, in partnership with PatientCrossroads, has committed to develop the Integrated Data Engagement Analytics (IDEA) platform to facilitate the collection, analysis, and sharing of genetic, proteomic, and EMR/phenotypic data to accelerate disease research. PatientCrossroads and DNAnexus are currently engaging in a pioneering effort to help patients obtain the raw genetic files and medical records and then integrate these data along with patient reported outcomes data obtained by PatientCrossroads on a secure and compliant platform that allows authorized researcher access to this information and use it to develop novel insights — the IDEA platform. You can review the complete list of public and private sector Cancer Moonshot commitments announced in the White House press release.

Here at DNAnexus, we are particularly devoted to reducing the technical barriers to accessing and working with research datasets.  We believe that a culture of openness in genomic research will lead to greater medical breakthroughs. Most data sharing in cancer genomics research has been centralized through rich, yet controlled-access databases like The Cancer Genome Atlas (TCGA) or International Cancer Genome Consortium (ICGC) — both of which properly approved researchers can easily access on the DNAnexus Platform. Read more about some of the genomic community collaborative initiatives DNAnexus is a part of: precisionFDA, open access cancer genomics pilot, and ICGC.

 

Once in a Blue Moon Competition: precisionFDA Truth Challenge

The FDA, the Global Alliance for Genomics and Health (GA4GH)  and National Institute for Standards and Technology (NIST) recently teamed up to create a once-in-a-blue-moon challenge for genomic scientists! Dubbed the precisionFDA Truth Challenge, genomic innovators were invited to test their informatics pipelines on two datasets, the well-characterized Genome in a Bottle’s (GiaB) NA12878 (HG001) reference sample and a new reference sample HG002, of which the results were unknown.

PrecisionFDA_TruthChallenge_Image

PrecisionFDA is an online, cloud-based, virtual research space where members of the genomics community can experiment, share data and tools, collaborate, and define standards for evaluating analytical pipelines. Community members span academia, industry, healthcare organizations and government.  All of these organizations are working together to further innovation and develop regulatory science around NGS tests. So far, the community currently includes more than 1,500 users across 600 organizations, with more than 10 terabytes of genetic data stored.

This is the second challenge issued through precisionFDA, following the precisionFDA Consistency Challenge.   The Truth Challenge is about discovering the consistency and accuracy of informatics pipelines when analyzing a human sample whose truth data is unknown. NIST and GiaB released the truth data May 26, 2016, after the close of the challenge.

What makes this challenge so exciting?

NIST released NA12878 in 2014, the first gold standard whole human reference genome, in collaboration with GiaB and the FDA. Since then,  it has arguably become one of the most studied biospecimens. Researchers from around the world use NA12878 as training data for assessing pipeline performance.

Since many pipelines use some sort of machine learning algorithm when trying to determine whether a reported mutation is real or not,  the difficulty that arises is ensuring a pipeline doesn’t overfit the training data. Pipelines can ultimately be tuned, in order to maximize performance on the training dataset, and if the test data happens to be similar to the training data the pipeline’s performance would be abnormally consistent and accurate. A great resource in understanding why scientists split data into train and test roles in order to assess the accuracy, reliability, and credibility of their predictive models (the algorithm that goes into a pipeline) can be found here.

In order to test performance of pipelines in real-life, scientists needed a second reference sample and associated truth callset of which NGS pipelines have not been trained on. This is exactly what NIST and GiaB have provided in reference sample HG002.

Scientists can now evaluate algorithms using test data that is separate from the training data, an attribute  that is broadly accepted as fundamental to the evaluation methodology. Moreover, unlike NA12878, the new reference sample HG002 is male, which poses new challenges to algorithms since there is only one copy of the X chromosome, and brings new opportunity for evaluating NGS methods along this dimension.

The winners

As the clock struck midnight EST on May 25, 2016, the precisionFDA Truth Challenge closed with 36 entries across 21 teams, spanning 5 countries;  truly an international competition of epic proportions!

The winners of the Truth Challenge will be announced at the upcoming Festival of Genomics in Boston on June 29th at 8:45am EST by Elizabeth Mansfield, PhD, Deputy Director for Personalized Medicine in FDA’s Center for Devices and Radiological Health’s Office of In Vitro Diagnostics and Radiological Health. Registration is free. Want to learn more about precisionFDA?  Stop by the DNAnexus booth (# 240)  during the Festival to receive a demo of the precisionFDA platform from a member of the precisionFDA team.

Want to recreate the Truth Challenge for yourself? Join the precisionFDA community today and evaluate a pipeline of your choice against HG002. Happy testing!