Public genomic resources such as UK Biobank are invaluable to researchers around the world. But the size and complexity of these datasets — which often measure in hundreds of thousands of samples and petabytes of data — present a barrier to many users, even if the data are openly accessible to bona fide researchers.
We have been proud to work closely with UK Biobank, Regeneron Genetics Center (RGC), Amazon Web Services, and other partners to provide tools that enable scientists to tap into these enormous banks of genomic and phenotypic data, such as DNAnexus Apollo and the cloud-based UK Biobank Research Analysis Platform.
In collaboration with the RGC, we have now delved even deeper to help alleviate the challenges of harmonizing sequencing data across large projects without sacrificing flexibility. At such large scales, even small differences in read mapping can manifest in downstream analyses, and while reprocessing is inevitable in the long run, it is best (and cheapest) to get it right the first time.
So, we set out to make consistent and lossless read mapping as hands-off as possible.
The result? The OQFE App.
The Original Quality Functionally Equivalent App is a one-button implementation of the OQFE read mapping protocol which was used for the recent 200,000 WES UK Biobank release. Users only need to provide CRAM or FASTQ containing Illumina short reads, and the OQFE app will generate a CRAM meeting all OQFE specifications including:
- Deterministic alt-aware mapping to the GRCh38 reference genome
- Duplicate-marked primary and supplementary alignments
- Lossless read and base-quality compression (i.e., fully recoverable FASTQs)
- Forward compatibility with quality-score recalibration and binning schemes
The OQFE App isn’t like most bioinformatics tools, with loads of options and tunable parameters. Because the goal is data harmonization, the OQFE App hard codes all software versions, reference files and runtime parameters. (Of course, the OQFE app is open-source and built from open-source software, so feel free to take it apart and see what makes it tick!) To understand how we landed on this protocol, here’s a bit of history.
The OQFE protocol is an extension of the functionally equivalent (FE) pipeline created in 2017, which used reference-based compression (CRAM) to harmonize more than 400,000 whole-genome samples worth of data across multiple large-scale NIH sequencing projects, with a three-fold reduction in size. The FE pipeline specifies an exact version of the GRCh38 reference genome, parameters to ensure deterministic alt-aware mapping, a duplicate-marking strategy that ensures supplementary alignments are appropriately marked, and a set of consistent CRAM tags.
The OQFE protocol retains all these features with updated versions of the constituent programs to address some minor bugs (see the OQFE paper details). Unlike the FE protocol, however, the OQFE protocol does not implement any base-quality recalibration or quality-score binning: all original quality scores are retained. This advance is made possible by the general shift of short-read sequencing to the NovaSeq platform, which has natively four-valued quality scores (i.e., they are already well binned). OQFE CRAMs are thus both backward compatible (you can remake the original FASTQs) and forward compatible (you can directly call variants, or run quality score recalibration and apply your own binning strategy).
All UK Biobank exome CRAMs are processed by the OQFE protocol, and the OQFE App provides multiple features that make it easy for researchers to harmonize their own data with the UKB exomes.
- Start with a FASTQ: If you have raw reads, the OQFE App will create all the required CRAM tags, you just provide a sample name.
- Start with a CRAM: If your reads are already mapped or reference-compressed, the OQFE App will name-sort your CRAM, roll back to a FASTQ, and OQFE-map your reads. This option also lets you keep your existing RG string, so it is a great way to add custom information to your OQFE CRAMs.
- Pick your pixels: Specify how optical duplicates are marked. Different flowcell geometries determine different optical duplicate requirements. But don’t worry, pixel distance doesn’t affect the CRAMs (the same reads get marked as duplicates no matter what pixel distance you select), just how the duplicate reads are categorized in the duplicate summary output.
The OQFE App is especially easy for our users to execute on the DNAnexus Platform. A DNAnexus user simply needs to add input files and hit run to execute the exact methods autonomously, without the need for a specialized environment. And since UK Biobank files will be available on the UK Biobank Research Analysis Platform, there will be no need to download data or transfer files to local environments.
But harmonization only works if everyone can do it, so you don’t need to be on DNAnexus to run OQFE. To make OQFE accessible to all, the OQFE App is open-source software, free for anyone to download as a single containerized OQFE pipeline with all required source and validation files that can be executed on any local or cloud infrastructure that supports Docker.