DipAsm: A Method for Generating More Accurate Phased Assemblies

DipAsm Phased Assemblies

Researchers from academia and industry, including Li Lab (Dana-Farber Cancer Institute), Church Lab (Wyss Institute, Harvard University), DNAnexus Research Lab, and others, have developed a new genome assembly approach , dubbed DipAsm, for generating chromosome-scaled phased contigs using long reads and long-range confirmation data. The method, described in Nature Biotechnology, could generate results within a day and outperforms other approaches in terms of contiguity and completeness of the phased assemblies. As shown in the paper, when the method was applied to four public datasets, it produced haplotype-resolved assemblies with contig NG50 of up to 25 Mb and phased almost all heterozygous sites with 98-99 percent accuracy.

Being able to generate accurate chromosome-scale haplotype-resolved assemblies is crucial for capturing the heterozygous variation present in human genomes and understanding allele-specific methylation and gene expression in research and clinical applications. The accuracy of DipAsm’s assemblies makes it a valuable tool for exploring highly polymorphic parts of the genome such as the Human Leukocyte Antigen region, and the Killer-cell Immunoglobulin-like Receptor region. “Our phased assemblies can reconstruct most of these regions with two contigs for each haplotype,” first author Shilpa Garg, a postdoctoral researcher in the Li Lab, explained in the paper. They also demonstrate how the method enables the identification of both SNPs and structural variants with greater sensitivity and specificity than some current methods.

Full details of how DipAsm works are provided in the paper. But briefly, it reconstructs the haplotypes present in diploid individuals using PacBio’s long High-Fidelity reads and Hi-C data as input. DipAsm works with data from an unphased Peregrine assembly scaffolded by 3D-DNA or HiRise. It uses DeepVariant to call small variants, phases them using WhatsHap and HapCUT2, partitions the reads, and then assembles each partition independently using the Peregrine assembly toolkit.

DipAsm Phased Assemblies Pipeline

To demonstrate the method’s accuracy, the researchers applied it to data from four human genomes: the PGP1 from the Personal Genome Project, HG002 and NA12878 from the Genome In a Bottle dataset (GIAB), and HG00733 from the HGSVC project. Full details of the assembly statistics including specifics about the assemblies used for the comparison to the results of the DipAsm pipeline are provided in Table 1 in the paper.

From the GIAB HG002 sample, the researchers generated a phased de novo assembly of 5.95 gigabases that incorporated data from both parental haplotypes. Compared to results from trio binning-based assemblies, the DipAsm assembly achieved better contiguity, and disagreed with less than 0.5% of phased heterozygous SNPs.

To evaluate the consensus accuracy of the DipAsm assembly, the researchers used dipcall to align the phased contigs of the HG002 that they created against the human reference genome. Next, they called SNPs and insertions and deletions from the alignment and compared these calls to the GIAB truth dataset. Out of the 2.36Gb confident regions in GIAB, the DipAsm assembly generated over 5,700 false SNP alleles (about 0.19% of called SNPs), and over 65,000 false insertion and deletion alleles (over 11% of called indels). It “achieves a consensus accuracy comparable to the Arrow-polished TrioCanu assembly,” according to the researchers.

Comparing the assembly to the GIAB truth data demonstrates DipAsm’s phasing power. “During assembly, failing to partition reads in heterozygous regions leads to the loss of heterozygotes,” the team explains. “On this metric, our Hi-C based assemblies only miss 0.4% of heterozygous SNPs. Those results are about 8 times better than those gleaned from a trio binning-based assembly, which is less powerful potentially because it is unable to phase a heterozygote when all individuals in a trio are heterozygous at the same site,” the researchers noted in the paper. Furthermore, “trio binning breaks short reads into k-mers, which also reduces power in comparison to mapping full-length paired-end Hi-C reads in our pipeline.”

Haplotypes through MHC Region Chart
This plot shows 6 haplotype resolved contigs of three individuals through the MHC region and the high degree of divergence from the reference genome. Visit the paper to see the full figure.

In terms of long indels (>50bp), the DipAsm assembly-based call set showed over 93% sensitivity and 92% precision compared to the GIAB structural variant truth dataset. In comparison, trio binning-based call sets had about 3% lower sensitivity for indels and small variants. The researchers also identified various structural variants in the DipAsm haplotype assemblies including microsatellites, simple repeats, and short interspersed nuclear elements.

Other results reported in the paper describe findings from comparing phased SNP calls from the DipAsm version of the HG00733 assembly to calls from the Human Genome Structural Variation Consortium.  The results showed that the DipAsm assembly had a slightly lower phasing error rate and phased more heterozygous SNPs. The team also used DipAsm to assemble the NA12878 and PGP1 genomes. And those results showed that “we can achieve chromosome-long phasing albeit the shorter read length of NA12878 and the lower read coverage of PGP1,” the researchers wrote. Comparisons of these assemblies to those in the GIAB truth set indicated that DipAsm’s NA12878 assembly offers better consensus accuracy.

MDA and DNAnexus Partner to Improve Neuromuscular Patient Care and Accelerate Drug Discovery

MDA MOVR & VRP Platform

Outcomes for many with neuromuscular disease have improved dramatically in recent years, with the launch of Biogen’s Spinraza drug and Novartis’ gene therapy Zolgensma for spinal muscular atrophy treatments among those making recent headlines.

We are proud that our platforms can play an integral role in helping to fuel research and drug development in neuromuscular disease through a new partnership with the Muscular Dystrophy Association (MDA)

MDA has doubled down on efforts to ensure those able to directly impact the lives of people living with neuromuscular disease have state-of-the-art tools to share data, and at the core of that is a new visualization and analysis platform powered by DNAnexus. 

The neuroMuscular ObserVational Research (MOVR) Visualization and Reporting Platform (VRP) will enable 37 MDA Care Centers to analyze data from the MOVR Data Hub – MDA’s HIPAA-compliant, CDISC-formatted registry that collects longitudinal data in seven disease indications: amyotrophic lateral sclerosis, spinal muscular atrophy, Duchenne muscular dystrophy, Becker muscular dystrophy, facioscapulohumeral muscular dystrophy, limb-girdle muscular dystrophy and Pompe disease.

In order to extract value from the MOVR Data Hub, a comprehensive data harmonization and ingestion strategy was required. DNAnexus has worked with other leading organizations like UK Biobank, City of Hope, and now MDA, working with a multitude of common data models such as OMOP, CDISC, and other data schemas to support each customer’s unique use case. Now that the MOVR data has been ingested researchers can now unleash the power of this tremendous dataset. 

Created both with clinicians and researchers in mind, the MOVR VRP features an intuitive and customizable interface, allowing different levels of analysis, from overviews of disease progression and outcomes across sites, to in-depth dives into clinical parameters across large cohorts of neuromuscular patients. 

This level of correlative analyses could ultimately stimulate new drug, biologics and gene therapy discoveries. Exploration of deeply curated neuromuscular disease cohorts in the MOVR VRP could also help in clinical trial design, by enabling clinical researchers to rapidly identify populations that meet specific clinical criteria.  

DNAnexus’ hallmark security, compliance and collaboration components will also enable increased accessibility of MOVR data to the wider neuromuscular clinical and research communities. And the platform is expected to make it easier for MDA to carry out training, fulfill data requests, review publications, encourage peer-to-peer collaboration and publishing, and share learnings across multiple channels.  

The MOVR VRP currently leverages phenotype data from the MOVR Data Hub, but it has been developed to handle a wide range of EHR and genetic data types, enabling the platform to scale up as needed. 

“The MOVR database, combined with the visualization and analysis platform from DNAnexus, allows us to make the most of this data in a way that really brings it to life to aid in developing new therapeutics.”

MDA’s President & CEO, Lynn O’Connor Vos

We are proud to serve as the technology platform bringing together MDA researchers and their partners to advance cures for this collection of neuromuscular diseases.

From Pharmacogenomics to new Benchmarking Frameworks, DNAnexus Delivers at ASHG20

ASHG 2020

Although we’ll miss the chance to spend some time in San Diego, we’re excited for the opportunity to expose our research to the wider world, as the popular annual meeting of the American Society of Human Genetics (ASHG) goes virtual this year.

From pharmacogenomics to frameworks for benchmarking, the DNAnexus research team will be presenting posters on a number of exciting topics. We’ve included them all below, for easy reference.

Are you susceptible to adverse drug reactions?

ASHG 2020 Poster 3591

Pharmacogenomics (PGx) is an embodiment of precision medicine. Yet gene-drug specific dosing guidelines are still limited. Using LightGBM  – a decision-tree based gradient boosting machine learning framework – and UK Biobank phenotypic and genomic data, Chiao-Feng Lin and colleagues from the xVantage and Apollo Data Science teams investigated whether it is possible to predict an individual’s risk to adverse drug reactions, irrespective of drugs. Check out her Reviewers Choice Award poster to learn what they found, and how the Apollo Cohort Browser can make sample selection simpler.
Poster 3591

The Portable Workflow Environs: Cloud-scale workflows with nano-scale effort 

ASHG 2020 Poster 2217 Portable Workflows

As the field of bioinformatics is largely converging on open standards, such as Workflow Description Language (WDL), for the development of workflows, we have developed a suite of tools that can be leveraged for their efficient and scalable development and deployment. Join John Didion as he introduces PoWEr (The Portable Workflow Environs), an open-source ecosystem of tools to simplify and accelerate the full life-cycle of workflow development and deployment.
Poster 2217

Direct-to-risk: A scalable framework for end-to-end GWAS, fine-mapping and risk prediction on UK Biobank

Poster 3802 Direct to Risk

While polygenic risk score (PGS) analyses seem straightforward, they consist of a plethora of disparate steps, each with their own datasets, methods, and hidden assumptions. DNAnexus Apollo can be used to conduct end-to-end PGS analyses with semi-automation, scalability, traceability, interoperability, and iterability. Peter Nguyen will demonstrate how, using the framework to conduct PGS analyses for type 2 diabetes by processing phenotypic and genomic data across 500,000 thousand UK Biobank participants and 90 million variants.
Poster 3802

A WDL-based framework for benchmarking germline variant calling pipelines for high-throughput sequencing data

Poster 2018 WDL Benchmark

Wondering which variant caller would work best for you? We developed a WDL-based framework and benchmarked six variant callers for their accuracy and runtime. Join Yih-Chii Hwang as she describes the flexible framework and how it can be further customized and applied to benchmark any bioinformatics pipelines of interest. 
Poster 2018

In addition to our own research, we were proud to contribute to the following work by our customers and partners:

A new Genome in a Bottle benchmark for hard to assess and medically important genes

Poster 2009 Genome in a Bottle

We were proud to contribute to this project, whose goal was to create a high quality benchmark variant call set for medically relevant genes, to help researchers and clinical related applications with the next gene sequencing. Be sure to check out the Reviewers Choice Award poster by the multi-institutional Genome In a Bottle team.
Poster 2009

Automated repeat characterization of filaggrin from PacBio Sequel HiFi long reads

Poster 2046 PacBio Reads

Filaggrin (FLG) is a medically important protein coding gene, associated with dermatitis, atopic, ichthyosis vulgaris, and other conditions. This poster demonstrates how to do efficient variant calling for this repeat-rich gene using a new method based on PacBio SMRT Sequencing. A DNAnexus applet was also created to help our collaborators at the Regeneron Genetics Center process thousands of samples for their research work on the related diseases. 
Poster 2046

Doubly Confused: Evaluation of splicing variant impact assessment with computational prediction, and vice versa

Many variants affecting splicing remain of unknown significance in the absence of definitive molecular and/or clinical data. Using data from three carefully chosen genes — HPRT, BRCA1, and ABCA4 — we helped scientists at the University of Maryland assess the quality of splice variant impact assessment software. Check out the poster to see which software came ahead. 
Poster 2088

Sequencing Quality Control Phase II: Inter and intra variability of NGS and its implication on structural variation detection

Genomic structural variation (SV) includes several different classes of mutations, including deletions, insertions, translocations, duplications, inversions and complex rearrangements. The complexity of these different events continues to make the detection of SV challenging. This is amplified by the usage of different sequencing technologies, centers, or across replicates within and between cohort studies. To better understand the impact of variability due to preparation, sequencing and analysis, we helped Baylor scientists compare the sequence of DNA from a Chinese family consisting of two identical twins and their parents at three different sequencing centers with three replicates each as part of the Sequencing Quality control Phase II (SEQC2) study. We utilized multiple analytical pipelines and compared the resulting 288 data sets within and across the family members. Check out the poster to see what was found.
Poster 2202