Collaborative Genomics: Highlights From 2016

This has been a remarkable year for DNAnexus and the genomics industry at large. As 2016 comes to a close, we celebrate the year’s accomplishments, and everyone who contributed to its successes.

The past 12 months were jam-packed with innovative research, strategic partnerships, platform enhancements, productive meetups, and plenty of conferences. Here are some of our favorite highlights from 2016!

  • In that vein, we are humbled to have contributed to a wide range of efforts with the goal of working together to advance precision medicine. We kicked off the year by powering a collaborative breast cancer study by the National Comprehensive Cancer Network. Researchers from 14 different institutions worked together to identify potential therapies for a patient with metastatic triple-negative breast cancer — an unprecedented level of collaboration as a result of leveraging the DNAnexus Platform.
  • We also celebrated the one-year anniversary of President Obama’s Precision Medicine Initiative (PMI) with the first PMI Summit. This set in motion much of our year to come, particularly our work delivering the precisionFDA platform, an important component of the President’s PMI. PrecisionFDA, a Bio-IT World Best Practices award-winning program, was designed to leverage community participation to advance regulatory science for next-generation sequencing assay evaluation. 

Highlights of precisionFDA’s initiatives this year included three challenges with the goal of engaging and sharing data to improve DNA test results. Community members had the opportunity to test pipelines, discuss best practices, and add software to the precisionFDA platform. These challenges were wildly successful, and garnered a great amount of community involvement.

  • We would be remiss not to mention the powerful insights coming out of the Geisinger and Regeneron collaboration, powered by DNAnexus. De-identified EMR data from Geisinger’s MyCode Community Health Initiative is integrated with whole exome sequencing data from these same patients. The power of this joint approach was apparent in a paper published in the New England Journal of Medicine, revealing a genetic variant that appears to result in reduced levels of triglycerides and a lower risk of coronary artery disease. As the partnership continually adds more patients with associated EHR data and sequenced exomes, the power of these studies will only increase. Learn more about the program in this Mendelspod podcast.
  • To further move the needle on cancer research, Singapore-based POLARIS (Personalized OMIC Lattice for Advancing Research and Improving Stratification) began using DNAnexus to enable a series of genomic tests for cancer. These include gastrointestinal and solid tumor cancer tests, which are part of a systematic effort to develop a framework for omics-based tests within Singapore.
  • M2Gen also adopted the DNAnexus Platform to support data data analysis and collaboration for the Oncology Research Information Exchange Network (ORIEN) Avatar Research Program. This innovative program joins academic cancer centers and pharmaceutical companies in their efforts to study and treat cancer through the development of more precise treatments for patients.
  • In addition to powering POLARIS and ORIEN, DNAnexus reanalyzed The Cancer Genome Atlas (TCGA) dataset. TCGA is a joint effort between the National Cancer Institute and National Human Genome Research Institute, and includes data from 10,487 patients across 33 cancer types. This reanalysis project was a massive undertaking, whereby during a four-week period, approximately 1.8 million core-hours of computational time were used to process 400 TB of data — a testament to the scalability of the DNAnexus Platform.
  • A number of partnerships were announced to further build out DNAnexus as a seamless end-to-end solution for genome analysis. The integration of the DNAnexus Platform with Sapio Science’s Exemplar Next Generation Sequencing Laboratory Information Management System (LIMS) enables laboratory management and informatics solutions in the cloud. We also partnered with SolveBio to provide access to their curated data analysis services, offering a rapid and secure progression from data analysis through interpretation. Finally, Genomics plc entered into a collaborative effort with DNAnexus to break down the barriers for population-scale sequencing analysis.
  • Together with PacBio, we worked to simplify structural variant discovery and decrease barriers to de novo assembly. As PacBio’s cloud bioinformatics partner, we are able to support researchers working with long-read sequencing data. Through this effort, the SMRT Analysis Suite v3.1.1 has been optimized for the cloud environment and is available on the DNAnexus Platform. Other long-read analysis tools, such as PBHoney, PBJelly, and the Parliament are also optimized for use on the platform.

A particularly exciting development in the realm of long-read sequencing was the release of the first public Sequel dataset of NA12878, demonstrating you don’t need expensive high-fold coverage to discover novel structural variants. The Sequel System is faster, half the cost, and provides higher throughput, delivering around 7 times the amount of data as the PacBio RS II. These added benefits will hopefully make long-read sequencing available to a broader audience.

  • Finally, from a corporate perspective, DNAnexus also had some serious wins. Our team  nearly doubled in 2016 to keep up with the ever-increasing activity from our customers, and to keep pace with the burgeoning genomics industry. We couldn’t be more fired up about the growth of our team!

Special thanks to all our customers, partners, and collaborators for contributing to another amazing year filled with exciting milestones. We’re delighted by the developments within the genomics industry, and look forward to 2017 with excitement and inspiration.

What is your favorite memory from 2016? Let us know on Twitter. #Genomics16

DNAnexus & TCGA: Reanalyzing the World’s Largest Pan-Cancer Initiative Dataset

The Cancer Genome Atlas (TCGA), a joint effort between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), was established in 2006 to create a detailed catalog of genetic mutations responsible for cancer using next-generation sequencing.  Over the years, TCGA collaborators have generated over 2.5 petabytes of data collected from nearly 11,000 patients, describing 34 different tumor types (including 10 rare cancers) based on paired tumor and normal tissue sets.

TCGtcgaA has been an incredible learning process for the genomics community. In the start of this initiative, for example,  researchers didn’t know as much about mutation calling.  Over the past decade, however, we’ve improved sensitivity of variant callers and have a much greater number of samples to assess genomic markers.

“In addition, mutation calling for TCGA samples was primarily done for individual tumor types, with projects using different mutation callers or different versions of the callers, meaning the data wasn’t uniform,” said Carolyn Hutter, PhD, Program Director, Division of Genomic Medicine at the NHGRI. “We now believe the best way to do analysis is to have a uniform set of calls generated by multiple mutation callers, with quality control and filtering, across multiple cancer types. That’s why the TCGA team decided to go back and recall the over 10,000 exomes in TCGA and produce this multi-caller somatic mutation dataset.”

Resequencing the TCGA dataset was a massive undertaking. The necessary compute resources for a large-scale project of this nature was not in place at TCGA member institutes. The DNAnexus Platform provided important requirements for the mutation calling project, including patient security, a scalable environment that could handle tens of thousands of exomes, and reproducibility of results. Over a four-week period approximately 1.8 million core-hours of computational time were used to process 400 TB of data, yielding reproducible results.

“Realigning TCGA data with a single methodology across new standardized mutation callers will make the tumor data much more relevant to the community. The DNAnexus Platform allowed us to create a uniform  and analytical treatment through version-controlled analyses and tools that would have been challenging to replicate at any single facility in a reasonable time frame,” said David Wheeler, PhD, Professor, Department of Molecular and Human Genetics at Baylor College of Medicine.  “With this standardized set of mutation calls obtained by several callers, we’ll be able to identify genetic alterations contributing to cancer that are shared between tumors independent of the tissue-of-origin. We are optimistic that having access to such information will spur advancement in precision medicine.”

Key TCGA results to date have been:

  • Improved understanding of the genomic underpinnings of cancer
  • Reclassification of cancer by identifying tumor subtypes with distinct sets of genomic alterations
  • Insights into treatment approaches based on currently available therapies or used to help with drug development.

The value of this reanalysis under a single methodology across new standardized mutation callers allows for the samples to be compared across cancer types. This will facilitate further new findings, such as if one individual’s breast cancer may show greater genomic similarity to a subtype of ovarian cancer than to other types of breast cancer. In the future, we believe patients will be treated based on their genomic profile rather than the origin of  their cancer. DNAnexus is proud to collaborate with TCGA in making this important dataset more useful to the cancer research community.

Researchers now have access to the TCGA pipelines via the DNAnexus Platform in addition to a GitHub repository. DNAnexus works to ensure mechanisms for data access requests and vending data to approved requestors meets security standards for dbGaP and TCGA data in the cloud.

Review the latest NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy, where the NIH cited DNAnexus Compliance White Paper.

Ten-Fold PacBio Sequel Call Set Proves Affordable & Effective in Identifying Structural Variants

Pacific Biosciences, a key partner of DNAnexus, has released the first public Sequel™ dataset of NA12878. This is a 10-fold coverage set featuring 32.8Gb of data, with an N50 read length of 11.8kb. Generated by PacBio’s new Sequel System, this dataset was used to demonstrate the robust ability of even low coverage long-read data to discover novel structural variants. The Sequel System is smaller, faster, and provides higher throughput, delivering around 7X the amount of data as the PacBio RS II.

screen-shot-2016-11-28-at-10-21-24-amThe Sequel System is half the cost of a PacBio RS II, five times faster and produces seven times as much data per SMRT Cell.  We believe that with these improvements the Sequel System is poised to open up long-read sequencing to a broader audience.  It will allow access to more robust applications ranging from genome and transcriptome assembly to variant detection. This is demonstrated by the Parliament Suite on DNAnexus, where combining low coverage PacBio reads with a short read dataset can significantly improve both the accuracy and the number of structural variant calls compared to short reads alone. In addition, 10-fold PacBio sequencing of NA12878 has been shown to recall 84% of known structural variants (SV) and identifies thousands more not previously seen in short reads by using SV tool, PBHoney.

Aside from structural variation, PacBio long reads have been used for robust, high quality de novo genome and transcriptome assemblies. Additionally, with instrument and chemistry improvements made for the Sequel System, the cost for generating a 50-fold coverage human dataset for resequencing and de novo assembly is expected to decrease dramatically.

Despite sequencing costs dropping, de novo genome assembly and structural variant calling remain complex tasks; ones that can require massive computational resources to weave long reads into a final, polished assembly or to run structural variation detection methods across multiple data types.  For this reason, PacBio has selected DNAnexus to be its cloud bioinformatics partner, providing bioinformatics support to its global customers. The SMRT® Analysis Suite v3.1.1 is available on the DNAnexus Platform and has been optimized for the cloud environment, as well as other long read analysis tools, such as PBHoney, PBJelly, and Parliament.

Curious about PacBio tools and services on DNAnexus? Schedule a 30-minute scientific consultation.


De novo assemblies of individual human genomes via the PacBio RS II at high-fold coverage have revealed tens of thousands of structural variants, many of which are accessible only through SMRT Sequencing. In an effort to optimize SV discovery methods, PacBio set out to understand what SV’s could be identified in a well-studied human sample NA12878 from low-fold coverage sequencing on the new Sequel System. To create the NA12878 Sequel dataset, PacBio generated approximately 10-fold coverage of the NA12878 sample on the Sequel System, which comes to about $5,000 in sequencing cost. The newly generated long reads were mapped to GRCh37 human reference using NGM-LR, and structural variants were called with PBHoney.

The output calls were compared to a “truth set” generated as a merged set between the 1000 Genomes Project and Genome in a Bottle NA12878 sets, both of which were analyzed using short read technology at much higher coverage. The low coverage PacBio 10-fold Sequel System set recalled 86% of truth set deletions and 81% of truth set insertions. Above and beyond, the 10-fold Sequel call set identified thousands of insertions and deletions not found in the short read truth set, with over 66% of these novel structural variants verified using a FALCON-Unzip 60-fold PacBio RS II de novo assembly.

This dataset demonstrates that low coverage Sequel reads can be used for accurate variant calling as well as novel structural variant identification, all of which is now available at a fraction of the cost with the new Sequel System. Through the partnership with DNAnexus, you can recreate the analysis performed on the NA12878 10-fold set. These tools and the generated dataset can be found on DNAnexus under Featured Projects.

Questions? Contact us directly at: