New UK Biobank Exome Pipeline: A 1-Button Read-Mapping Protocol for Public Sequencing Datasets

OQFE Application

Public genomic resources such as UK Biobank are invaluable to researchers around the world. But the size and complexity of these datasets — which often measure in hundreds of thousands of samples and petabytes of data — present a barrier to many users, even if the data are openly accessible to bona fide researchers.  

We have been proud to work closely with UK Biobank, Regeneron Genetics Center (RGC), Amazon Web Services, and other partners to provide tools that enable scientists to tap into these enormous banks of genomic and phenotypic data, such as DNAnexus Apollo and the cloud-based UK Biobank Research Analysis Platform.

In collaboration with the RGC, we have now delved even deeper to help alleviate the challenges of harmonizing sequencing data across large projects without sacrificing flexibility. At such large scales, even small differences in read mapping can manifest in downstream analyses, and while reprocessing is inevitable in the long run, it is best (and cheapest) to get it right the first time.

So, we set out to make consistent and lossless read mapping as hands-off as possible.

The result? The OQFE App.

The Original Quality Functionally Equivalent App is a one-button implementation of the OQFE read mapping protocol which was used for the recent 200,000 WES UK Biobank release. Users only need to provide CRAM or FASTQ containing Illumina short reads, and the OQFE app will generate a CRAM meeting all OQFE specifications including:

  • Deterministic alt-aware mapping to the GRCh38 reference genome
  • Duplicate-marked primary and supplementary alignments
  • Lossless read and base-quality compression (i.e., fully recoverable FASTQs)
  • Forward compatibility with quality-score recalibration and binning schemes

The OQFE App isn’t like most bioinformatics tools, with loads of options and tunable parameters. Because the goal is data harmonization, the OQFE App hard codes all software versions, reference files and runtime parameters. (Of course, the OQFE app is open-source and built from open-source software, so feel free to take it apart and see what makes it tick!) To understand how we landed on this protocol, here’s a bit of history.

The OQFE protocol is an extension of the functionally equivalent (FE) pipeline created in 2017, which used reference-based compression (CRAM) to harmonize more than 400,000 whole-genome samples worth of data across multiple large-scale NIH sequencing projects, with a three-fold reduction in size. The FE pipeline specifies an exact version of the GRCh38 reference genome, parameters to ensure deterministic alt-aware mapping, a duplicate-marking strategy that ensures supplementary alignments are appropriately marked, and a set of consistent CRAM tags.

The OQFE protocol retains all these features with updated versions of the constituent programs to address some minor bugs (see the OQFE paper details). Unlike the FE protocol, however, the OQFE protocol does not implement any base-quality recalibration or quality-score binning: all original quality scores are retained. This advance is made possible by the general shift of short-read sequencing to the NovaSeq platform, which has natively four-valued quality scores (i.e., they are already well binned). OQFE CRAMs are thus both backward compatible (you can remake the original FASTQs) and forward compatible (you can directly call variants, or run quality score recalibration and apply your own binning strategy).

All UK Biobank exome CRAMs are processed by the OQFE protocol, and the OQFE App provides multiple features that make it easy for researchers to harmonize their own data with the UKB exomes.

  1. Start with a FASTQ: If you have raw reads, the OQFE App will create all the required CRAM tags, you just provide a sample name.
  2. Start with a CRAM: If your reads are already mapped or reference-compressed, the OQFE App will name-sort your CRAM, roll back to a FASTQ, and OQFE-map your reads. This option also lets you keep your existing RG string, so it is a great way to add custom information to your OQFE CRAMs.
  3. Pick your pixels: Specify how optical duplicates are marked. Different flowcell geometries determine different optical duplicate requirements. But don’t worry, pixel distance doesn’t affect the CRAMs (the same reads get marked as duplicates no matter what pixel distance you select), just how the duplicate reads are categorized in the duplicate summary output.

The OQFE App is especially easy for our users to execute on the DNAnexus Platform. A DNAnexus user simply needs to add input files and hit run to execute the exact methods autonomously, without the need for a specialized environment. And since UK Biobank files will be available on the UK Biobank Research Analysis Platform, there will be no need to download data or transfer files to local environments.

But harmonization only works if everyone can do it, so you don’t need to be on DNAnexus to run OQFE. To make OQFE accessible to all, the OQFE App is open-source software, free for anyone to download as a single containerized OQFE pipeline with all required source and validation files that can be executed on any local or cloud infrastructure that supports Docker.

Interested in using the OQFE protocol? You can find the workflow publicly available in the DNAnexus App Library or through Docker Hub.

St. Jude Cloud Provides Model For Cancer Collaboration

St. Jude Cloud Cancer Collaboration

When researching a rare disease with many subtypes driven by diverse and distinct genetic alterations, data sharing is key. Samples acquired by a single institute, a single research initiative, or even a single nation may lack sufficient power for genomic discovery and clinical correlative analysis, and the mass of raw data from whole genome sequencing presents challenges. 

Which is why we were proud to partner with St. Jude Children’s Research Hospital and Microsoft to create a solution: a cloud-based, data-sharing ecosystem that has proved to be a model for harmonized genetic data and collaboration across the pediatric cancer community. 

Since the initial announcement of the partnership in 2018, more than 1.25 petabytes of data have been incorporated into the St. Jude Cloud, including:

  • 12,104 whole genomes;
  • 7,697 whole exomes; and 
  • 2,202 transcriptomes, from more than 
  • 10,000 pediatric cancer patients and long-term survivors, and 
  • 800 pediatric sickle cell patients

As reported recently in the journal Cancer Discovery, this makes it the largest publicly available genomic data resource for pediatric cancer, and it has already helped advance research. 

For example, Camille Keenan and colleagues gained new insight into a rare C11orf95 fusion in ependymoma by uploading and analyzing their RNA-Seq samples using the RNA Classification workflow on St. Jude Cloud. The Cancer Discovery paper includes additional use cases that classify 135 pediatric cancer subtypes by gene expression profiling and map mutational signatures across 35 pediatric cancer subtypes. 

How does the St. Jude Cloud work? Raw and curated genomic data, analysis and visualization tools are structured into three inter-connected apps: 

  • Genomics Platform, for accessing data and analysis workflows; 
  • PeCan, a Pediatric Cancer Knowledgebase for exploring a curated knowledgebase of more than 5,000 pediatric cancer genomes; and 
  • Visualization Community, for exploring published pediatric cancer genomic or epigenomic landscape maps, and for visualizing user data using ProteinPaint or GenomePaint. 
St Jude Cloud Visual Pipeline

Common use cases, such as assessing the recurrence of a rare genomic variant or the expression status of a gene of interest, are built into these apps, eliminating the need to download data and perform custom analyses. To enable researchers with little to no formal computational training to perform sophisticated genomic analysis, we also developed eight end-to-end analysis workflows designed with a point-and-click interface for uploading input files and graphically visualizing the results. 

“Effective sharing of genomic data and a community effort to elucidate etiology are…critical to developing effective therapeutic strategies,” the Cancer Discovery paper authors wrote. “The complementarity amongst the three apps within the St. Jude Cloud ecosystem enables the optimal use of computational resources so that researchers can focus on innovative analyses leading to new insights.”

The project leverages Microsoft Azure data storage and our open and flexible DNAnexus Portals™ workspace to create a secure environment compliant with all of the major data privacy standards (HIPAA, CLIA, CGP, 21 CFR Parts 22, 58, 493, and European data privacy laws and regulations). 

As the paper authors note, St. Jude Cloud currently hosts genomic data generated primarily by St. Jude studies, but they envision it will serve as a collaborative research platform for the broader pediatric cancer community in the future. 

“User-uploaded data can be analyzed and explored alongside the wealth of curated and raw pediatric genomic data on St. Jude Cloud, and deposition of user data into St. Jude Cloud requires minimal effort. In this regard St. Jude Cloud represents a community resource, framework, and significant contribution to the pediatric genomic sequencing data sharing landscape.” 

2020 Vision: What Have We Learned?

2020 Vision: Lessons Learned

The year 2020 certainly didn’t go as planned. But it was educational. We all had a crash course in infectious disease biology, trial-by-fire lessons in virtual meeting etiquette, and eye-opening lessons in the difficulties of homeschooling and trying to maintain a work-life balance within the same few walls for days, weeks, and months at a time.

At DNAnexus, we also learned a few things — about our customers and their evolving needs, about our capacity to pivot and adapt to meet those needs, and about our ability to be even more creative than ever in delivering novel solutions to this new norm. Here are some of our top lessons.

1. Easy, reliable access is key. One of the biggest challenges people have faced while working from home is accessing the data, applications, and compute resources they need. With the majority of employees suddenly using the VPN every day, network bottlenecks and slow download times can quickly become a chronic problem. And in many cases, some data and resources may simply not be accessible to users who aren’t on-premise. Luckily, the DNAnexus Platform and the Cloud Workstation App proved to be lifesavers, enabling secure, fluid, reliable collaboration and sharing with partners and peers around the globe. 

2. You love to learn. We developed a suite of free bioinformatics courses available to users of varying experience levels, and many of you were eager to brush up on your skills, or learn new ones. The online curriculum, which was selected with input from our customers, was so popular that we will be continuing it into the new year. We’re especially excited about the first one — Demystifying AI & ML in Biomedical Research — which will be held January 28. Add it to your calendars now! You can save your seat here.

3. Going virtual means going global. Although we missed the chance to spend some time connecting with our customers and the science/technology community in person, we were excited for the opportunity to expose our research to the wider world, as most conferences went virtual. The American Society of Human Genetics (ASHG), for instance, featured a host of great science from our own team as well as many of our customers and research partners. We also got to flex our creative muscles designing some fun virtual ‘booths’ and resource pages.

4. Opportunity sometimes knocks on the door of disaster. Watching the fevered pace of early research into SARS-CoV-2, David Fenstermacher, our Vice President of Precision Medicine and Data Services, wondered: Could COVID power a new era of precision medicine, moving it beyond oncology? Infectious disease seems a strong candidate for a precision medicine approach, he argued, due to the high variability between patients, and being able to link genetic profiles to clinical outcomes would be extremely useful when developing diagnostics and formulating treatment plans. By stratifying patients based on genetic information, healthcare providers and government decision-makers could adopt more rational and effective surveillance, containment, and treatment strategies.

5. Science doesn’t stop. Nor does innovation. In fact, we may have gotten even more innovative in order to keep scientific discoveries coming apace. Scrolling through our blog, you can find story after story about improvements to our platform, new applications, and examples of some of the amazing research it is enabling. We were proud to be recognized as a top workplace for innovators, and we look forward to continuing to carry out our mission of revolutionizing the use of genomic and other omic information in healthcare in 2021.