MDA and DNAnexus Partner to Improve Neuromuscular Patient Care and Accelerate Drug Discovery

MDA MOVR & VRP Platform

Outcomes for many with neuromuscular disease have improved dramatically in recent years, with the launch of Biogen’s Spinraza drug and Novartis’ gene therapy Zolgensma for spinal muscular atrophy treatments among those making recent headlines.

We are proud that our platforms can play an integral role in helping to fuel research and drug development in neuromuscular disease through a new partnership with the Muscular Dystrophy Association (MDA)

MDA has doubled down on efforts to ensure those able to directly impact the lives of people living with neuromuscular disease have state-of-the-art tools to share data, and at the core of that is a new visualization and analysis platform powered by DNAnexus. 

The neuroMuscular ObserVational Research (MOVR) Visualization and Reporting Platform (VRP) will enable 37 MDA Care Centers to analyze data from the MOVR Data Hub – MDA’s HIPAA-compliant, CDISC-formatted registry that collects longitudinal data in seven disease indications: amyotrophic lateral sclerosis, spinal muscular atrophy, Duchenne muscular dystrophy, Becker muscular dystrophy, facioscapulohumeral muscular dystrophy, limb-girdle muscular dystrophy and Pompe disease.

In order to extract value from the MOVR Data Hub, a comprehensive data harmonization and ingestion strategy was required. DNAnexus has worked with other leading organizations like UK Biobank, City of Hope, and now MDA, working with a multitude of common data models such as OMOP, CDISC, and other data schemas to support each customer’s unique use case. Now that the MOVR data has been ingested researchers can now unleash the power of this tremendous dataset. 

Created both with clinicians and researchers in mind, the MOVR VRP features an intuitive and customizable interface, allowing different levels of analysis, from overviews of disease progression and outcomes across sites, to in-depth dives into clinical parameters across large cohorts of neuromuscular patients. 

This level of correlative analyses could ultimately stimulate new drug, biologics and gene therapy discoveries. Exploration of deeply curated neuromuscular disease cohorts in the MOVR VRP could also help in clinical trial design, by enabling clinical researchers to rapidly identify populations that meet specific clinical criteria.  

DNAnexus’ hallmark security, compliance and collaboration components will also enable increased accessibility of MOVR data to the wider neuromuscular clinical and research communities. And the platform is expected to make it easier for MDA to carry out training, fulfill data requests, review publications, encourage peer-to-peer collaboration and publishing, and share learnings across multiple channels.  

The MOVR VRP currently leverages phenotype data from the MOVR Data Hub, but it has been developed to handle a wide range of EHR and genetic data types, enabling the platform to scale up as needed. 

“The MOVR database, combined with the visualization and analysis platform from DNAnexus, allows us to make the most of this data in a way that really brings it to life to aid in developing new therapeutics.”

MDA’s President & CEO, Lynn O’Connor Vos

We are proud to serve as the technology platform bringing together MDA researchers and their partners to advance cures for this collection of neuromuscular diseases.

From Pharmacogenomics to new Benchmarking Frameworks, DNAnexus Delivers at ASHG20

ASHG 2020

Although we’ll miss the chance to spend some time in San Diego, we’re excited for the opportunity to expose our research to the wider world, as the popular annual meeting of the American Society of Human Genetics (ASHG) goes virtual this year.

From pharmacogenomics to frameworks for benchmarking, the DNAnexus research team will be presenting posters on a number of exciting topics. We’ve included them all below, for easy reference.

Are you susceptible to adverse drug reactions?

ASHG 2020 Poster 3591

Pharmacogenomics (PGx) is an embodiment of precision medicine. Yet gene-drug specific dosing guidelines are still limited. Using LightGBM  – a decision-tree based gradient boosting machine learning framework – and UK Biobank phenotypic and genomic data, Chiao-Feng Lin and colleagues from the xVantage and Apollo Data Science teams investigated whether it is possible to predict an individual’s risk to adverse drug reactions, irrespective of drugs. Check out her Reviewers Choice Award poster to learn what they found, and how the Apollo Cohort Browser can make sample selection simpler.
Poster 3591

The Portable Workflow Environs: Cloud-scale workflows with nano-scale effort 

ASHG 2020 Poster 2217 Portable Workflows

As the field of bioinformatics is largely converging on open standards, such as Workflow Description Language (WDL), for the development of workflows, we have developed a suite of tools that can be leveraged for their efficient and scalable development and deployment. Join John Didion as he introduces PoWEr (The Portable Workflow Environs), an open-source ecosystem of tools to simplify and accelerate the full life-cycle of workflow development and deployment.
Poster 2217

Direct-to-risk: A scalable framework for end-to-end GWAS, fine-mapping and risk prediction on UK Biobank

Poster 3802 Direct to Risk

While polygenic risk score (PGS) analyses seem straightforward, they consist of a plethora of disparate steps, each with their own datasets, methods, and hidden assumptions. DNAnexus Apollo can be used to conduct end-to-end PGS analyses with semi-automation, scalability, traceability, interoperability, and iterability. Peter Nguyen will demonstrate how, using the framework to conduct PGS analyses for type 2 diabetes by processing phenotypic and genomic data across 500,000 thousand UK Biobank participants and 90 million variants.
Poster 3802

A WDL-based framework for benchmarking germline variant calling pipelines for high-throughput sequencing data

Poster 2018 WDL Benchmark

Wondering which variant caller would work best for you? We developed a WDL-based framework and benchmarked six variant callers for their accuracy and runtime. Join Yih-Chii Hwang as she describes the flexible framework and how it can be further customized and applied to benchmark any bioinformatics pipelines of interest. 
Poster 2018

In addition to our own research, we were proud to contribute to the following work by our customers and partners:

A new Genome in a Bottle benchmark for hard to assess and medically important genes

Poster 2009 Genome in a Bottle

We were proud to contribute to this project, whose goal was to create a high quality benchmark variant call set for medically relevant genes, to help researchers and clinical related applications with the next gene sequencing. Be sure to check out the Reviewers Choice Award poster by the multi-institutional Genome In a Bottle team.
Poster 2009

Automated repeat characterization of filaggrin from PacBio Sequel HiFi long reads

Poster 2046 PacBio Reads

Filaggrin (FLG) is a medically important protein coding gene, associated with dermatitis, atopic, ichthyosis vulgaris, and other conditions. This poster demonstrates how to do efficient variant calling for this repeat-rich gene using a new method based on PacBio SMRT Sequencing. A DNAnexus applet was also created to help our collaborators at the Regeneron Genetics Center process thousands of samples for their research work on the related diseases. 
Poster 2046

Doubly Confused: Evaluation of splicing variant impact assessment with computational prediction, and vice versa

Many variants affecting splicing remain of unknown significance in the absence of definitive molecular and/or clinical data. Using data from three carefully chosen genes — HPRT, BRCA1, and ABCA4 — we helped scientists at the University of Maryland assess the quality of splice variant impact assessment software. Check out the poster to see which software came ahead. 
Poster 2088

Sequencing Quality Control Phase II: Inter and intra variability of NGS and its implication on structural variation detection

Genomic structural variation (SV) includes several different classes of mutations, including deletions, insertions, translocations, duplications, inversions and complex rearrangements. The complexity of these different events continues to make the detection of SV challenging. This is amplified by the usage of different sequencing technologies, centers, or across replicates within and between cohort studies. To better understand the impact of variability due to preparation, sequencing and analysis, we helped Baylor scientists compare the sequence of DNA from a Chinese family consisting of two identical twins and their parents at three different sequencing centers with three replicates each as part of the Sequencing Quality control Phase II (SEQC2) study. We utilized multiple analytical pipelines and compared the resulting 288 data sets within and across the family members. Check out the poster to see what was found.
Poster 2202

How to Get Reliable Variant Calling in Repeat-Rich Regions

Diploid Assembly Approach

Current sequencing technology and computational algorithms support the construction of phased diploid genome assemblies, and these are useful for studying genomic regions with high variability. For repeat-rich regions, mapping based methods with short reads usually fails to give reliable variants call and it is even harder to get a phased variant callset. 

A recent study, published in Nature Communications, describes efforts to develop the first variant benchmark using a diploid assembly, overcoming the obstacles of the mapping only approaches, using the HG002 dataset from the Genome in a Bottle (GIAB) consortium. Below is a summary of the findings from the recent collaboration between DNAnexus and GIAB, which builds on the work from an NCBI pangenomics hackathon hosted by UCSC in March of 2019 and on GIAB’s existing efforts to develop truth datasets that support genomic research.

The research team specifically focused on benchmarking variants in one of the most polymorphic and medically important parts of the genome – the major histocompatibility complex (MHC). The MHC plays a crucial role in adaptive and innate immunity among other activities. There is a high level of variability between individual genomes that makes characterizing this region particularly challenging. Previous benchmarks that relied primarily on short-read mapping methods are unable to map large portions of the region’s sequences because of large differences between the reads and the reference. This is particularly true for very repetitive regions of the genome that contain things like segmental duplications and tandem repeats.

Recent improvements in de novo assembly using more accurate, consensus long reads have made it possible to “represent both haplotypes without suffering from small indel errors due to error-prone long reads,” the researchers note in the paper. Previously published assemblies still had “substantial error rates for small variants of at least 10 % due to their reliance on error-prone long reads, and the individual long and ultralong read assembly incompletely resolved haplotypes.” Linked read assembly methods have also been helpful for resolving the MHC but the results are “fragmented” with similar error rates for small variants.

To generate this benchmark, the researchers developed a local de novo assembly approach that uses long-read whole-genome sequencing data. In this method, the data is partitioned into two haplotypes using ultralong nanopore sequencing reads and barcoded short reads from long DNA molecules. Specifically, the team used WhatsHap to combine long-range phasing information from these reads, and then separated the circular consensus long reads into haplotypes for the diploid assembly. 

Haplotype Contig Assembly
Fig. 1: Assembling a single contig for each haplotype

The next step was to “assemble both haplotypes of the MHC and use dipcall to generate benchmark variants and regions” in the GIAB sample, the researchers wrote. Specifically, they used dipcall to call variants by aligning the main haplotigs to the GRCh37 reference and using defined benchmark regions that excluded structural variants, extremely divergent regions, low quality regions, and long homopolymers.

The final benchmark set consists of phased small and structural variants. They formed small variant benchmark regions covering 94 percent of the MHC, with 49 percent – over 7,300 – more variants than was possible with previous mapping-based benchmarks. The benchmark regions include 22,368 benchmark SNVs and indels smaller than 50 bp, and covers almost all 23 HLA genes with some exceptions due to the high degree of variability in specific portions of genes. 

To check the accuracy of the benchmark set, the team compared their assembly-based diploid variant calls to 11 variant call sets generated using various methods and sequencing technologies. The results of the comparison revealed the benchmark reliably identified false positives and false negatives across methods and technologies. In addition, they found “high concordance” between the results of their assembly-based benchmark and existing mapping-based benchmarks within regions accessible to mappers. The additional variants that the assembly-based method found “are likely from those regions where HG002 has at least one haplotype that is highly diverged from the reference, making it challenging to map individual reads.”

“This benchmark was critical for the precisionFDA Truth Challenge V2 we held this summer, which highlighted performance of variant callers in the MHC,”

Justin Zook, Human Genetics Team Lead at NIST.

These are promising results for efforts to understand and map other highly variable parts of the genome. But there is still plenty of room for improvement. “Since most of the MHC alternate loci in GRCh38 and other MHC sequences are not fully continuous assemblies, our assembled haplotigs represent two of only a few continuously assembled MHC haplotypes,” the researchers wrote.  “We expect this curated benchmark set from a targeted diploid assembly will help the community improve variant calling methods and whole genome de novo assembly methods and form a basis for future diploid assembly-based benchmarks.”

Furthermore, with initiatives such as the Human Pangenome Reference Consortium which aims to sequence 350 human genomes using long read technology, “our knowledge about the whole MHC region will increase rapidly,” the researchers wrote. “A pan-genomics variant call benchmark for many individuals may become essential for economically genotyping the whole MHC region correctly.”

Meanwhile, there are other highly variable regions that would benefit from a benchmark set including the KIR and IGH loci as well as segmental duplications. Moving forward, “a combination of long read and short read technologies for resolving difficult genomic regions in many individuals will become important,” according to the team. “We hope the rich collections of diverse data sets and analyses for the GIAB samples and the future population-scale de novo sequencing will enable precision medicine from complicated genomic regions like MHC.” You can read the full paper here.