Improving Therapeutic Development at Biogen with UK Biobank Data, Databricks, and DNAnexus

UK Biobank to Inform Drug Development

Genomic datasets from current day sequencing projects involve thousands of samples that routinely reach and exceed the petabyte mark. Researchers looking to use these data need analysis pipelines and compute infrastructure that are optimized for processing and querying heterogeneous data quickly and consistently. Recently, experts from Biogen, Databricks, and DNAnexus discussed a collaboration focused on infrastructure for analyzing large genomics data from the UK Biobank. The discussion was part of a session for the Spark+AI 2020 virtual summit, held in June.

The UK Biobank (UKB) has been collecting data from participants, ages 40 to 69, at 22 centers across the United Kingdom. The data is for a 30-year, large-scale population study of the effects of genetic predisposition and environmental exposure to disease development. About 500,000 volunteers were recruited to contribute phenotype and genome information for the study. In 2018, Regeneron Genetics Center, in collaboration with eight pharma companies formed the UK Biobank Exome Sequencing Consortium (UK-ESC) to sequence the exomes of all UK Biobank participants. Currently, over 90% of the project has been completed. 

Consortium partners like Biogen have early access to the data. The company is exploring the protein-coding regions of the genome to better understand the biological basis for neurobiological disease. It is using the UKB information to rank candidate compounds in its drug portfolio as well as identify novel gene targets.  In his portion of the presentation, David Sexton, Senior Director, Genome Technology and Informatics at Biogen, highlighted some of his company’s early experiences with working with such a large dataset.

“The UK Biobank data will be approximately one petabyte of data, and we did not have that storage currently in our data center,” he explained. Biogen also needed more bandwidth to move the petabyte-size dataset into its data center for processing and analysis. Recognizing these needs, Biogen turned to DNAnexus and Databricks for help in implementing cloud infrastructure and optimized pipelines that would support high quality, consistent genomic analysis and scale as needed.

In August, the UK Biobank announced it enlisted the services of DNAnexus to help develop its own cloud-based Data Analysis Platform. DNAnexus has previously worked with UK Biobank and the Regeneron Genetics Center to deliver exome sequencing results back to the UK-ESC members. The DNAnexus Titan and Apollo Platforms make it possible to process and analyze large datasets quickly and efficiently without compromising the data integrity and security. In his part of the presentation, John Ellithorpe, DNAnexus’ President & Chief Product Officer, highlighted the complexities of analyzing a 500,000-participant dataset with heterogeneous data types.

Analysis requires tools to handle the raw sequence reads and identify variants, as well as to combine the sequence data with patient health and assessment information in a way that lets researchers query and search it. With 500,000 samples, the sequencing step alone generates about two million files – including alignment, vcf, and index files. That amounts to roughly 1.5 petabytes of data that needs to be processed and cleaned. Each participant’s exome could contain millions of variants resulting in trillions of data points in the combined dataset.

To give a sense of how long it can take to process genomic data, Regeneron used a cost-optimized version of DNAnexus’ Titan product to analyze data from UKB datasets. That version of the solution was able to process a single sample in about four hours. For the 500,000 samples dataset, this translates to millions of machine hours “If you’re processing a few samples, it’s actually not that hard to do yourself in the cloud. But once you get into the levels of thousands, tens of thousands, hundreds of thousands of samples, you really want to do it consistently and efficiently, then the fault tolerance into the clouds are very important,” Ellithorpe said. “The ability to focus only on exceptions and the science and not having to deal with cloud optimizations and such is also quite important.” The DNAnexus Titan Platform blends the scalability of the cloud with a tailor made experience designed for large-scale genome-driven analyses.  

All that genomic data then needs to be combined with clinical information. This is usually pulled from electronic medical records, and includes demographic, health, and phenotype information from each patient. “There are a lot of different fields, over 3500 different phenotypic fields. And they’re also quite complex in that you might have significant encodings of the values, you might have hierarchies in terms of what the values could be,” Ellithorpe explained during his presentation.  There’s also the longitudinal data collected from participants to consider.

In total, the clinical data from the UKB cohort filled over 11,000 columns. This information needed to be combined with the genomic data so that consortia members could query and explore it. This is where DNAnexus’ Apollo Platform comes in. Apollo uses an Apache Spark-based engine to support both large scale rigorous analyses and ad-hoc interactive queries.  Rapid querying of the UKB data can only be supported by both horizontal (pheno fields) and vertical (genetic location)  partitioning of the data.  These underlying capabilities enable both computational scientists and clinical scientists to interact, analyze, and collaborate at scale.  It “takes the high quality datasets that are being processed out of Titan and combines it with structured data, that is the health assessment data, and provides you an ability to interact with it in a variety of ways,” Ellithorpe  explained.

In his portion of the presentation, Frank Nothaft, Technical Director for Healthcare and Life Sciences at Databricks, described some of the functionality that his company’s platform provided for Biogen. Databricks’ platform provides optimized cloud compute machines and notebook functionality as well as optimized software stacks for processing large-scale datasets These tools support initial data processing through statistical analysis of genomic variation. Databricks also provided Biogen with optimized versions of the Genome Analysis Toolkit’s best practices pipeline, the GATK joint genotyping pipeline, access to several open source libraries, and open-source tools for merging research datasets and running large-scale statistical analysis.

“We’ve been able to do things like reduce the latency of running the GATK germline pipeline on a high coverage whole genome from 30 hours to under 40 minutes by having about a 2x performance improvement from a CPU efficiency perspective, and then using the power of Spark to parallelize this work across many cores,” Nothaft explained.

Using these tools, Biogen researchers were able to quickly ingest and sort through the exome sequencing data and identify millions of variants. They could also annotate variants, compute genome-wide associations between variants and diseases or between variants and functional consequences to the genome. “They were able to take a pipeline that previously took two weeks to process 700,000 variants and accelerate it tremendously,” Nothaft said. “They were able to annotate two million variants in 15 minutes, so they had orders of magnitude acceleration there.”

With their updated infrastructure, Biogen researchers have been studying how different variants function and the consequences of their activity in the genome. The company has already used the UKB dataset to identify new drug targets, reposition compounds in its drug portfolio, and to deepen its understanding of the underlying biology of neurodegenerative diseases. You may watch the complete presentation here: Improving Therapeutic Development at Biogen with UK Biobank Data, Databricks, and DNAnexus.

Diving Deep into the Data Ocean

At City of Hope, POSEIDON was developed to manage its vast data ocean — an informatics platform fed by so many streams of data, it has created a deep, diverse ecosystem supporting the dynamic precision medicine program at one of the country’s top comprehensive cancer centers.

City of Hope Informatics Platform

In a recent fireside chat with our Chief Revenue Officer Mark Swendsen, City of Hope Senior Vice President and Chief Informatics Officer Sorena Nadaf dove in with details of the massive informatics challenge, and described why the California cancer center selected DNAnexus as its partner for the project. 

A wide pool of patient information   

From real-world data to real-world evidence to real-world action, City of Hope’s vision for its precision oncology program is one of full integration between research and clinical care, pooling a wide variety of data from multiple sources, to be accessed by bioinformaticians and physicians alike. 

Tens of thousands of patients go through the doors of the main City of Hope Duarte campus located outside Los Angeles and its 31 clinical network locations throughout Southern California each year, and each cancer patient has their tumors genetically sequenced to aid with therapeutic options. In addition, the new system will integrate other data affiliated with the patient journey, from disease registries to pathologies, molecular characterization of the tumor, medical record data, and clinical trials information.

Among the requirements of the new platform:

SCALE: “It needed to scale because we were bursting at the seams across our own internal network.” 

COLLABORATION: “Building a network was also critical for us. We needed a way to not stay siloed in our ideas. In academic matrix institutions, we collaborate deeply, across disease programs.”

EASE OF USE: “We needed to make it useful for bioinformaticians who are very technical, but we also needed it easy for oncologists who just wanted a dashboard to look at analytics really quickly, or to be able to browse across various data sets that are specific to their disease.” 

SECURITY: “Data governance and security is a big part of this. It needs to be well managed across a matrix organization.”

Why DNAnexus?

Nadaf and his colleagues spent several months scoping out their precision oncology platform project and quickly realized they couldn’t do it alone. 

“You can’t build it yourself anymore,” Nadaf said. “The time has come and gone where you can throw everything into an Excel spreadsheet or a REDCap database, or buy an infrastructure that’s already built and try to squeeze everything into it.” 

“We really needed an environment that was fine-tuned and revolved around our workflow. And we needed the right experts to help us strategize. Once we ingest all this data, how do we provide this as a solution, primarily for City of Hope, but also our growing network across the landscape of precision medicine institutions?”

Nadaf said he was familiar with DNAnexus thanks to previous work together via the Cancer Informatics for Cancer Centers (Ci4CC).

“It’s been a really good symbiotic relationship for us,” Nadaf said. “I think this partnership is extremely unique and I really do believe that we are onto something very, very special.”

Enabling super science and patient care

The platform is not just a status repository of data, Nadaf said. It’s been an “enabler.” It’s helped inform City of Hope’s unique in-house drug development. It’s helped place patients into clinical trials. It’s been a huge help for tumor boards, where clinicians, researchers, and technical curators come together to make decisions on tricky cases. And it has led to new research ideas, new methods, and new translational projects. 

It has already proved its value in savings in time and energy.

“I think it’s helped us really catapult our initiative a few years, to say the least,” Nadaf said. 

It’s also changed expectations, and opened up new possibilities. 

“It’s no longer a question of can we do this? It’s a matter of, this is exactly what we need, let’s find how to expand our platform to support it.”

Learn more about the partnership here.

DNAnexus R&D Report: Benchmarking Germline Variant Calling Pipelines

Germline Variant Calling

Variant calling is a staple of genome analysis. Developments in computational methods and algorithms have provided a plethora of programs and pipelines for calling secondary variants in both whole genome and whole exome sequence data. With so many options to choose from, picking a pipeline for an analysis project can be challenging.

DNAnexus researchers are helping to narrow that search by benchmarking the performance of popular variant calling pipelines, determining the factors that can affect performance, and matching appropriate pipelines to specific scientific needs. They describe the results of comparing six of these well-known pipelines in a recent report. The report includes detailed data on the versions of algorithms, compute instance type as well as data on runtimes and the accuracy of their variant calls. The team used variant truth sets from the Genome In a Bottle (GIAB) Consortium as the reference for the analysis. This dataset offers high confidence SNPs, INDELs, and homozygous reference regions generated using data from seven human genomes.

To test the pipelines, the research team used whole exome and whole genome sequence reads from NIST. Specifically, they downloaded five whole exome and seven whole genome samples. For the variant truth set, the researchers used high-confidence regions from version 3.3.2 of the NIST dataset. In terms of the pipelines, the team used four off-the-shelf applications from DNAnexus’ tools library, which offers access to containerized software tools and associated dependencies for easy implementation in compute environments. They packaged two additional apps for the comparison tests. The researchers processed each sample from raw reads through to variant calls using default pipeline parameters and settings. They used standards set by the Global Alliance for Genomics and Health (GA4GH) for assessing the accuracy of small germline variant calls. All six pipelines were each run on single AWS instances.

LabelSoftwareVersionAWS Instance Type+Variant Calling AlgorithmDNAnexus App
gatk4BWA-MEM + GATK0.7.17-r1188 + gatk-4.1.4.1c5d.18xlargeGATK HaplotypeCallerUpon request
parabricks_deepvariantParabricks Pipelines DeepVariantv3.0.0_2g4dn.12xlargeDeepVariantpbdeepvariant
parabricks_germlineParabricks Pipelines Germlinev3.0.0_1g4dn.12xlargeGATK HaplotypeCallerpbgermline
sentieon_dnascopeSentieon (DNAscope)sentieon_release_201911c5d.18xlargeDNAscopesentieon_fastq_to_vcf
sentieon_haplotyperSentieon (Haplotyper)sentieon_release_201911c5d.18xlargeGATK HaplotypeCallersentieon_fastq_to_vcf
strelka2Strelka22.9.10c5d.18xlargeStrelka2strelka2_germline
Table 1: Germline variant calling software. We intentionally selected instance types with similar AWS hourly rates.

In general, the pipelines performed comparably. They called SNPs and INDELs in both the whole-exome and whole-genome samples, with over 99% recall and precision. There is some variation between pipelines that is likely due to limitations of the sample data collection methods, as well as the reference build that was used. The DNAnexus researchers called variants against both GRCh38 and hs37d5 builds. The GRCh38 build is more complete but contains lots of repetitive sequences in some genomic regions that pose problems for pipelines, and result in more false negative and false positive calls. Lastly, as noted above, this test used default pipeline parameters. In practice, researchers tweak these depending on the genomic region or loci they are studying. These changes influence pipelines’ precision and recall rates.

The runtimes were also comparable, with most pipelines performing notably faster than GATK4. In practice, these runtimes will vary depending on the type of hardware, the parameters, and the efficiency of the algorithms used. Also, the rates reported here will likely change as developers update their software, and reference datasets continue to evolve and improve. GIAB has already updated at least one of truth datasets that was used in this study. DNAnexus researchers plan to publish a follow-up report to this one that will include updated information and analysis.

As with many things, the best pipeline for an analysis project boils down to research needs and available resources. The pipelines used in this report represent the current state-of-the-art and provide a useful starting point for making decisions about variant calling infrastructure. If you have a particular use case you would like to discuss, please reach out. We would be happy to talk with you.