How to Get Reliable Variant Calling in Repeat-Rich Regions

Diploid Assembly Approach

Current sequencing technology and computational algorithms support the construction of phased diploid genome assemblies, and these are useful for studying genomic regions with high variability. For repeat-rich regions, mapping based methods with short reads usually fails to give reliable variants call and it is even harder to get a phased variant callset. 

A recent study, published in Nature Communications, describes efforts to develop the first variant benchmark using a diploid assembly, overcoming the obstacles of the mapping only approaches, using the HG002 dataset from the Genome in a Bottle (GIAB) consortium. Below is a summary of the findings from the recent collaboration between DNAnexus and GIAB, which builds on the work from an NCBI pangenomics hackathon hosted by UCSC in March of 2019 and on GIAB’s existing efforts to develop truth datasets that support genomic research.

The research team specifically focused on benchmarking variants in one of the most polymorphic and medically important parts of the genome – the major histocompatibility complex (MHC). The MHC plays a crucial role in adaptive and innate immunity among other activities. There is a high level of variability between individual genomes that makes characterizing this region particularly challenging. Previous benchmarks that relied primarily on short-read mapping methods are unable to map large portions of the region’s sequences because of large differences between the reads and the reference. This is particularly true for very repetitive regions of the genome that contain things like segmental duplications and tandem repeats.

Recent improvements in de novo assembly using more accurate, consensus long reads have made it possible to “represent both haplotypes without suffering from small indel errors due to error-prone long reads,” the researchers note in the paper. Previously published assemblies still had “substantial error rates for small variants of at least 10 % due to their reliance on error-prone long reads, and the individual long and ultralong read assembly incompletely resolved haplotypes.” Linked read assembly methods have also been helpful for resolving the MHC but the results are “fragmented” with similar error rates for small variants.

To generate this benchmark, the researchers developed a local de novo assembly approach that uses long-read whole-genome sequencing data. In this method, the data is partitioned into two haplotypes using ultralong nanopore sequencing reads and barcoded short reads from long DNA molecules. Specifically, the team used WhatsHap to combine long-range phasing information from these reads, and then separated the circular consensus long reads into haplotypes for the diploid assembly. 

Haplotype Contig Assembly
Fig. 1: Assembling a single contig for each haplotype

The next step was to “assemble both haplotypes of the MHC and use dipcall to generate benchmark variants and regions” in the GIAB sample, the researchers wrote. Specifically, they used dipcall to call variants by aligning the main haplotigs to the GRCh37 reference and using defined benchmark regions that excluded structural variants, extremely divergent regions, low quality regions, and long homopolymers.

The final benchmark set consists of phased small and structural variants. They formed small variant benchmark regions covering 94 percent of the MHC, with 49 percent – over 7,300 – more variants than was possible with previous mapping-based benchmarks. The benchmark regions include 22,368 benchmark SNVs and indels smaller than 50 bp, and covers almost all 23 HLA genes with some exceptions due to the high degree of variability in specific portions of genes. 

To check the accuracy of the benchmark set, the team compared their assembly-based diploid variant calls to 11 variant call sets generated using various methods and sequencing technologies. The results of the comparison revealed the benchmark reliably identified false positives and false negatives across methods and technologies. In addition, they found “high concordance” between the results of their assembly-based benchmark and existing mapping-based benchmarks within regions accessible to mappers. The additional variants that the assembly-based method found “are likely from those regions where HG002 has at least one haplotype that is highly diverged from the reference, making it challenging to map individual reads.”

“This benchmark was critical for the precisionFDA Truth Challenge V2 we held this summer, which highlighted performance of variant callers in the MHC,”

Justin Zook, Human Genetics Team Lead at NIST.

These are promising results for efforts to understand and map other highly variable parts of the genome. But there is still plenty of room for improvement. “Since most of the MHC alternate loci in GRCh38 and other MHC sequences are not fully continuous assemblies, our assembled haplotigs represent two of only a few continuously assembled MHC haplotypes,” the researchers wrote.  “We expect this curated benchmark set from a targeted diploid assembly will help the community improve variant calling methods and whole genome de novo assembly methods and form a basis for future diploid assembly-based benchmarks.”

Furthermore, with initiatives such as the Human Pangenome Reference Consortium which aims to sequence 350 human genomes using long read technology, “our knowledge about the whole MHC region will increase rapidly,” the researchers wrote. “A pan-genomics variant call benchmark for many individuals may become essential for economically genotyping the whole MHC region correctly.”

Meanwhile, there are other highly variable regions that would benefit from a benchmark set including the KIR and IGH loci as well as segmental duplications. Moving forward, “a combination of long read and short read technologies for resolving difficult genomic regions in many individuals will become important,” according to the team. “We hope the rich collections of diverse data sets and analyses for the GIAB samples and the future population-scale de novo sequencing will enable precision medicine from complicated genomic regions like MHC.” You can read the full paper here.

Improving Therapeutic Development at Biogen with UK Biobank Data, Databricks, and DNAnexus

UK Biobank to Inform Drug Development

Genomic datasets from current day sequencing projects involve thousands of samples that routinely reach and exceed the petabyte mark. Researchers looking to use these data need analysis pipelines and compute infrastructure that are optimized for processing and querying heterogeneous data quickly and consistently. Recently, experts from Biogen, Databricks, and DNAnexus discussed a collaboration focused on infrastructure for analyzing large genomics data from the UK Biobank. The discussion was part of a session for the Spark+AI 2020 virtual summit, held in June.

The UK Biobank (UKB) has been collecting data from participants, ages 40 to 69, at 22 centers across the United Kingdom. The data is for a 30-year, large-scale population study of the effects of genetic predisposition and environmental exposure to disease development. About 500,000 volunteers were recruited to contribute phenotype and genome information for the study. In 2018, Regeneron Genetics Center, in collaboration with eight pharma companies formed the UK Biobank Exome Sequencing Consortium (UK-ESC) to sequence the exomes of all UK Biobank participants. Currently, over 90% of the project has been completed. 

Consortium partners like Biogen have early access to the data. The company is exploring the protein-coding regions of the genome to better understand the biological basis for neurobiological disease. It is using the UKB information to rank candidate compounds in its drug portfolio as well as identify novel gene targets.  In his portion of the presentation, David Sexton, Senior Director, Genome Technology and Informatics at Biogen, highlighted some of his company’s early experiences with working with such a large dataset.

“The UK Biobank data will be approximately one petabyte of data, and we did not have that storage currently in our data center,” he explained. Biogen also needed more bandwidth to move the petabyte-size dataset into its data center for processing and analysis. Recognizing these needs, Biogen turned to DNAnexus and Databricks for help in implementing cloud infrastructure and optimized pipelines that would support high quality, consistent genomic analysis and scale as needed.

In August, the UK Biobank announced it enlisted the services of DNAnexus to help develop its own cloud-based Data Analysis Platform. DNAnexus has previously worked with UK Biobank and the Regeneron Genetics Center to deliver exome sequencing results back to the UK-ESC members. The DNAnexus Titan and Apollo Platforms make it possible to process and analyze large datasets quickly and efficiently without compromising the data integrity and security. In his part of the presentation, John Ellithorpe, DNAnexus’ President & Chief Product Officer, highlighted the complexities of analyzing a 500,000-participant dataset with heterogeneous data types.

Analysis requires tools to handle the raw sequence reads and identify variants, as well as to combine the sequence data with patient health and assessment information in a way that lets researchers query and search it. With 500,000 samples, the sequencing step alone generates about two million files – including alignment, vcf, and index files. That amounts to roughly 1.5 petabytes of data that needs to be processed and cleaned. Each participant’s exome could contain millions of variants resulting in trillions of data points in the combined dataset.

To give a sense of how long it can take to process genomic data, Regeneron used a cost-optimized version of DNAnexus’ Titan product to analyze data from UKB datasets. That version of the solution was able to process a single sample in about four hours. For the 500,000 samples dataset, this translates to millions of machine hours “If you’re processing a few samples, it’s actually not that hard to do yourself in the cloud. But once you get into the levels of thousands, tens of thousands, hundreds of thousands of samples, you really want to do it consistently and efficiently, then the fault tolerance into the clouds are very important,” Ellithorpe said. “The ability to focus only on exceptions and the science and not having to deal with cloud optimizations and such is also quite important.” The DNAnexus Titan Platform blends the scalability of the cloud with a tailor made experience designed for large-scale genome-driven analyses.  

All that genomic data then needs to be combined with clinical information. This is usually pulled from electronic medical records, and includes demographic, health, and phenotype information from each patient. “There are a lot of different fields, over 3500 different phenotypic fields. And they’re also quite complex in that you might have significant encodings of the values, you might have hierarchies in terms of what the values could be,” Ellithorpe explained during his presentation.  There’s also the longitudinal data collected from participants to consider.

In total, the clinical data from the UKB cohort filled over 11,000 columns. This information needed to be combined with the genomic data so that consortia members could query and explore it. This is where DNAnexus’ Apollo Platform comes in. Apollo uses an Apache Spark-based engine to support both large scale rigorous analyses and ad-hoc interactive queries.  Rapid querying of the UKB data can only be supported by both horizontal (pheno fields) and vertical (genetic location)  partitioning of the data.  These underlying capabilities enable both computational scientists and clinical scientists to interact, analyze, and collaborate at scale.  It “takes the high quality datasets that are being processed out of Titan and combines it with structured data, that is the health assessment data, and provides you an ability to interact with it in a variety of ways,” Ellithorpe  explained.

In his portion of the presentation, Frank Nothaft, Technical Director for Healthcare and Life Sciences at Databricks, described some of the functionality that his company’s platform provided for Biogen. Databricks’ platform provides optimized cloud compute machines and notebook functionality as well as optimized software stacks for processing large-scale datasets These tools support initial data processing through statistical analysis of genomic variation. Databricks also provided Biogen with optimized versions of the Genome Analysis Toolkit’s best practices pipeline, the GATK joint genotyping pipeline, access to several open source libraries, and open-source tools for merging research datasets and running large-scale statistical analysis.

“We’ve been able to do things like reduce the latency of running the GATK germline pipeline on a high coverage whole genome from 30 hours to under 40 minutes by having about a 2x performance improvement from a CPU efficiency perspective, and then using the power of Spark to parallelize this work across many cores,” Nothaft explained.

Using these tools, Biogen researchers were able to quickly ingest and sort through the exome sequencing data and identify millions of variants. They could also annotate variants, compute genome-wide associations between variants and diseases or between variants and functional consequences to the genome. “They were able to take a pipeline that previously took two weeks to process 700,000 variants and accelerate it tremendously,” Nothaft said. “They were able to annotate two million variants in 15 minutes, so they had orders of magnitude acceleration there.”

With their updated infrastructure, Biogen researchers have been studying how different variants function and the consequences of their activity in the genome. The company has already used the UKB dataset to identify new drug targets, reposition compounds in its drug portfolio, and to deepen its understanding of the underlying biology of neurodegenerative diseases. You may watch the complete presentation here: Improving Therapeutic Development at Biogen with UK Biobank Data, Databricks, and DNAnexus.

Diving Deep into the Data Ocean

At City of Hope, POSEIDON was developed to manage its vast data ocean — an informatics platform fed by so many streams of data, it has created a deep, diverse ecosystem supporting the dynamic precision medicine program at one of the country’s top comprehensive cancer centers.

City of Hope Informatics Platform

In a recent fireside chat with our Chief Revenue Officer Mark Swendsen, City of Hope Senior Vice President and Chief Informatics Officer Sorena Nadaf dove in with details of the massive informatics challenge, and described why the California cancer center selected DNAnexus as its partner for the project. 

A wide pool of patient information   

From real-world data to real-world evidence to real-world action, City of Hope’s vision for its precision oncology program is one of full integration between research and clinical care, pooling a wide variety of data from multiple sources, to be accessed by bioinformaticians and physicians alike. 

Tens of thousands of patients go through the doors of the main City of Hope Duarte campus located outside Los Angeles and its 31 clinical network locations throughout Southern California each year, and each cancer patient has their tumors genetically sequenced to aid with therapeutic options. In addition, the new system will integrate other data affiliated with the patient journey, from disease registries to pathologies, molecular characterization of the tumor, medical record data, and clinical trials information.

Among the requirements of the new platform:

SCALE: “It needed to scale because we were bursting at the seams across our own internal network.” 

COLLABORATION: “Building a network was also critical for us. We needed a way to not stay siloed in our ideas. In academic matrix institutions, we collaborate deeply, across disease programs.”

EASE OF USE: “We needed to make it useful for bioinformaticians who are very technical, but we also needed it easy for oncologists who just wanted a dashboard to look at analytics really quickly, or to be able to browse across various data sets that are specific to their disease.” 

SECURITY: “Data governance and security is a big part of this. It needs to be well managed across a matrix organization.”

Why DNAnexus?

Nadaf and his colleagues spent several months scoping out their precision oncology platform project and quickly realized they couldn’t do it alone. 

“You can’t build it yourself anymore,” Nadaf said. “The time has come and gone where you can throw everything into an Excel spreadsheet or a REDCap database, or buy an infrastructure that’s already built and try to squeeze everything into it.” 

“We really needed an environment that was fine-tuned and revolved around our workflow. And we needed the right experts to help us strategize. Once we ingest all this data, how do we provide this as a solution, primarily for City of Hope, but also our growing network across the landscape of precision medicine institutions?”

Nadaf said he was familiar with DNAnexus thanks to previous work together via the Cancer Informatics for Cancer Centers (Ci4CC).

“It’s been a really good symbiotic relationship for us,” Nadaf said. “I think this partnership is extremely unique and I really do believe that we are onto something very, very special.”

Enabling super science and patient care

The platform is not just a status repository of data, Nadaf said. It’s been an “enabler.” It’s helped inform City of Hope’s unique in-house drug development. It’s helped place patients into clinical trials. It’s been a huge help for tumor boards, where clinicians, researchers, and technical curators come together to make decisions on tricky cases. And it has led to new research ideas, new methods, and new translational projects. 

It has already proved its value in savings in time and energy.

“I think it’s helped us really catapult our initiative a few years, to say the least,” Nadaf said. 

It’s also changed expectations, and opened up new possibilities. 

“It’s no longer a question of can we do this? It’s a matter of, this is exactly what we need, let’s find how to expand our platform to support it.”

Learn more about the partnership here.