Genomic datasets from current day sequencing projects involve thousands of samples that routinely reach and exceed the petabyte mark. Researchers looking to use these data need analysis pipelines and compute infrastructure that are optimized for processing and querying heterogeneous data quickly and consistently. Recently, experts from Biogen, Databricks, and DNAnexus discussed a collaboration focused on infrastructure for analyzing large genomics data from the UK Biobank. The discussion was part of a session for the Spark+AI 2020 virtual summit, held in June.
The UK Biobank (UKB) has been collecting data from participants, ages 40 to 69, at 22 centers across the United Kingdom. The data is for a 30-year, large-scale population study of the effects of genetic predisposition and environmental exposure to disease development. About 500,000 volunteers were recruited to contribute phenotype and genome information for the study. In 2018, Regeneron Genetics Center, in collaboration with eight pharma companies formed the UK Biobank Exome Sequencing Consortium (UK-ESC) to sequence the exomes of all UK Biobank participants. Currently, over 90% of the project has been completed.
Consortium partners like Biogen have early access to the data. The company is exploring the protein-coding regions of the genome to better understand the biological basis for neurobiological disease. It is using the UKB information to rank candidate compounds in its drug portfolio as well as identify novel gene targets. In his portion of the presentation, David Sexton, Senior Director, Genome Technology and Informatics at Biogen, highlighted some of his company’s early experiences with working with such a large dataset.
“The UK Biobank data will be approximately one petabyte of data, and we did not have that storage currently in our data center,” he explained. Biogen also needed more bandwidth to move the petabyte-size dataset into its data center for processing and analysis. Recognizing these needs, Biogen turned to DNAnexus and Databricks for help in implementing cloud infrastructure and optimized pipelines that would support high quality, consistent genomic analysis and scale as needed.
In August, the UK Biobank announced it enlisted the services of DNAnexus to help develop its own cloud-based Data Analysis Platform. DNAnexus has previously worked with UK Biobank and the Regeneron Genetics Center to deliver exome sequencing results back to the UK-ESC members. The DNAnexus Titan and Apollo Platforms make it possible to process and analyze large datasets quickly and efficiently without compromising the data integrity and security. In his part of the presentation, John Ellithorpe, DNAnexus’ President & Chief Product Officer, highlighted the complexities of analyzing a 500,000-participant dataset with heterogeneous data types.
Analysis requires tools to handle the raw sequence reads and identify variants, as well as to combine the sequence data with patient health and assessment information in a way that lets researchers query and search it. With 500,000 samples, the sequencing step alone generates about two million files – including alignment, vcf, and index files. That amounts to roughly 1.5 petabytes of data that needs to be processed and cleaned. Each participant’s exome could contain millions of variants resulting in trillions of data points in the combined dataset.
To give a sense of how long it can take to process genomic data, Regeneron used a cost-optimized version of DNAnexus’ Titan product to analyze data from UKB datasets. That version of the solution was able to process a single sample in about four hours. For the 500,000 samples dataset, this translates to millions of machine hours “If you’re processing a few samples, it’s actually not that hard to do yourself in the cloud. But once you get into the levels of thousands, tens of thousands, hundreds of thousands of samples, you really want to do it consistently and efficiently, then the fault tolerance into the clouds are very important,” Ellithorpe said. “The ability to focus only on exceptions and the science and not having to deal with cloud optimizations and such is also quite important.” The DNAnexus Titan Platform blends the scalability of the cloud with a tailor made experience designed for large-scale genome-driven analyses.
All that genomic data then needs to be combined with clinical information. This is usually pulled from electronic medical records, and includes demographic, health, and phenotype information from each patient. “There are a lot of different fields, over 3500 different phenotypic fields. And they’re also quite complex in that you might have significant encodings of the values, you might have hierarchies in terms of what the values could be,” Ellithorpe explained during his presentation. There’s also the longitudinal data collected from participants to consider.
In total, the clinical data from the UKB cohort filled over 11,000 columns. This information needed to be combined with the genomic data so that consortia members could query and explore it. This is where DNAnexus’ Apollo Platform comes in. Apollo uses an Apache Spark-based engine to support both large scale rigorous analyses and ad-hoc interactive queries. Rapid querying of the UKB data can only be supported by both horizontal (pheno fields) and vertical (genetic location) partitioning of the data. These underlying capabilities enable both computational scientists and clinical scientists to interact, analyze, and collaborate at scale. It “takes the high quality datasets that are being processed out of Titan and combines it with structured data, that is the health assessment data, and provides you an ability to interact with it in a variety of ways,” Ellithorpe explained.
In his portion of the presentation, Frank Nothaft, technical director for Healthcare and Life Sciences at Databricks, described some of the functionality that his company’s platform provided for Biogen. Databricks’ platform provides optimized cloud compute machines and notebook functionality as well as optimized software stacks for processing large-scale datasets These tools support initial data processing through statistical analysis of genomic variation. Databricks also provided Biogen with optimized versions of the Genome Analysis Toolkit’s best practices pipeline, the GATK joint genotyping pipeline, access to several open source libraries, and open-source tools for merging research datasets and running large-scale statistical analysis.
“We’ve been able to do things like reduce the latency of running the GATK germline pipeline on a high coverage whole genome from 30 hours to under 40 minutes by having about a 2x performance improvement from a CPU efficiency perspective, and then using the power of Spark to parallelize this work across many cores,” Nothaft explained.
Using these tools, Biogen researchers were able to quickly ingest and sort through the exome sequencing data and identify millions of variants. They could also annotate variants, compute genome-wide associations between variants and diseases or between variants and functional consequences to the genome. “They were able to take a pipeline that previously took two weeks to process 700,000 variants and accelerate it tremendously,” Nothaft said. “They were able to annotate two million variants in 15 minutes, so they had orders of magnitude acceleration there.”
With their updated infrastructure, Biogen researchers have been studying how different variants function and the consequences of their activity in the genome. The company has already used the UKB dataset to identify new drug targets, reposition compounds in its drug portfolio, and to deepen its understanding of the underlying biology of neurodegenerative diseases. You may watch the complete presentation here: Improving Therapeutic Development at Biogen with UK Biobank Data, Databricks, and DNAnexus.