How to Train Your DRAGEN – Evaluating and Improving Edico Genome’s Rapid WGS Tools


In this blog post, we discuss Edico Genome’s DRAGEN Bio-IT Platform for rapid secondary analysis. We benchmark DRAGEN for speed and accuracy on diverse WGS datasets. Finally, we detail how Edico Genome and DNAnexus collaborated to improve the DRAGEN pipeline performance on noisy datasets and PCR-samples in the newest version.

Introduction – DRAGEN and FPGAs

Sequencing volume continues to grow exponentially, exceeding the increase in CPU speed. As this growth strains analytical capacity, it creates demand for fast and efficient analysis approaches (and platforms like DNAnexus to coordinate them). Edico Genome develops genomic pipelines that leverage specialized Field Programmable Gate Array (FPGAs) to dramatically increase analysis speed.

Each move along the spectrum of CPU -> GPU -> FPGA -> ASIC hardware trades programming/execution flexibility for speed and efficiency. With skill, FPGAs are reprogrammable, allowing Edico Genome to load its algorithms into compatible cloud hardware and to update this logic. DRAGEN is available on DNAnexus as a platform app.

Edico Genome’s speed has proven critical in time-sensitive applications, such as rapid diagnosis of newborns in the NICU at Rady Children’s Hospital – a mutual Edico Genome-DNAnexus customer.

Notes on Proper Training and Evaluation of Your DRAGEN

DNAnexus recently released a method to generate real, noisy NGS data called Readshift (blog)(code). Edico Genome has produced a new version of their WGS tools, labeled DRAGEN V2+, which we evaluate on Readshift.

First, we present evaluations on an HG002 benchmark dataset which was never made available to Edico Genome to ensure that the improvements apply generally.  This set of benchmarks use 35X PCR-free WGS data with the hs37d5 reference. Evaluation is performed using the same methods as used on precisionFDA. We compare DRAGEN V2 (the prior version) to DRAGEN V2+ to demonstrate Edico Genome’s rapid improvement.

DRAGEN-Scale:  How Fast is DRAGEN?

Figure 1 compares execution speed of DRAGEN relative to popular pipelines executed on DNAnexus. For many of these apps (e.g. GATK3), DNAnexus has applied additional optimizations to improve parallelism, meaning they are faster than they would be on local infrastructure. When time is critical, DRAGEN V2 and V2+ are the clear leader.

DRAGEN accelerates both the mapping process and variant calling, which can be run independently. This allows users to mix-and-match if a specific variant caller is required. Upstream of the secondary analysis, Edico Genome makes an accelerated BCL2FASTQ tool which greatly improves speed and efficiency while producing identical FASTQs.

How Accurate is DRAGEN?

Figures 2 and 3 demonstrate that DRAGEN’s speed does not come at the expense of accuracy. The newest version of DRAGEN achieves a ~40% reduction in SNP Error rate and ~50% reduction in Indel Error rate. Edico Genome was also recently designated as one of the winners of the precisionFDA Hidden Treasures Challenge.

How to Train Your DRAGEN – New Improvements in DRAGEN V2+

The recent DNAnexus Readshift blog post compared pipeline performance in a number of noisy conditions. After this blog, DNAnexus and Edico Genome discussed improvements based on the findings of Readshift. Edico Genome rapidly iterated several new development versions which we built as DNAnexus Platform apps, evaluated, and discussed with Edico Genome.

These benchmarks are conducted with the same methods as on precisionFDA. Here we use the hg38 reference. For DRAGEN and GATK4 we use hg38 with ALT contigs. For other callers, we use the hg38 reference with decoy sequences, but without ALT contigs.

The labels +0.0 std, +0.5 std, etc… indicate a sample with X standard deviations worse average base quality than baseline. Think of these as progressively harder. (Full details)

Greater Accuracy on PCR Samples

In both the Readshift blog and an earlier DeepVariant blog, we noticed that certain HiSeqX samples caused much worse indel calling performance. We subsequently realized that these samples were the only non PCR-free ones. The use of PCR in samples is required in several cases, e.g. small DNA inputs, as well as new fast and automated-prep Illumina Nextera Flex kits. Strelka2’s strong results suggest that Illumina took special care to consider performance in PCR samples, and that other callers could learn similarly. DRAGEN’s performance now matches Strelka2’s on these samples as leading the pack in  performance for indels on PCR samples (the error mode that dominates in these samples). Improved Robustness Against Low-Quality Reads

Readshift’s main focus was to understand how calling performance degrades as the quality of a sequence run decreases. DRAGEN V2+ is not only more accurate, it is also able to resist the effect of lower quality reads up to the most extreme shift of +2.0 std.

* Discussion with Chris Saunders of Illumina indicates that a specific heuristic for SNP calling in Strelka2 may be interacting with Readshift.  We are working to make a version of Readshift which will not trigger this heuristic for a more accurate Strelka2 evaluation.

Faster Runtime on NovaSeq Samples

Readshift identified that low-quality NovaSeq data can lead to dramatically longer runtime (or program crashes). Brad Chapman has completed an excellent investigation into the use of read trimming in somatic calling that may help this and more broadly.

Edico Genome has made improvements to the runtime of DRAGEN on NovaSeq samples across the board, demonstrating the ability to quickly improve for new data types.

However, the relative slowdown in low-quality NovaSeq samples remains. DNAnexus and Edico are continuing discussions on how to improve this issue.

Future Directions

In addition, Edico Genome recently released its new DRAGEN Virtual Long Read Detection (VLRD) Pipeline (Coming Soon to DNAnexus), designed at achieving greater accuracy in segmental duplications than standard variant callers. In this pipeline, Edico Genome leverages the fact that because DRAGEN computes so quickly, they can leverage computationally intensive assembly-based techniques and jointly calling all regions that are similar. We hope to take a deeper investigation into this method (and these difficult, but important, regions of the genome) in a future blog. Based on the responsiveness of Edico in these collaborations, we are quite convinced there will be a:

How to Train your DRAGEN 2

 Edico’s Genome’s DRAGEN is available now as an easy to use app on DNAnexus. To pilot using DRAGEN V2+ on DNAnexus in your workflow, email 

Partnership with St. Jude and Microsoft – Let’s talk about it at HIMSS 2018

We’re partnering in an exciting new collaboration with St. Jude Children’s Research Hospital and Microsoft to analyze and store half a petabyte of pediatric cancer genomic data. This collaboration will accelerate discoveries and treatments to cure pediatric cancer and other rare diseases by giving researchers and clinicians the ability to collaborate globally and enabling the rapid generation and analysis of genomic data.

DNAnexus, deployed on Microsoft Azure, provides a secure and agile ecosystem in the cloud while simultaneously eliminating security, storage and speed limitations – all of which will enable St. Jude researchers to focus on complex problems on a collaborative, global scale.  

DNAnexus’ strength comes from its agile co-development process. We partner with our customers to solve new big data problems that are continuously evolving. Our team works closely with the St. Jude and Microsoft teams to determine the specific requirements and translated it into tailored solutions. From kick-off meeting to production deployment, its a seamless process that helps our customers and collaborators achieve their goals, no matter how ambitious.

With our secure, cloud-based infrastructure and complimentary tools, researchers will be able to integrate a multitude of disparate datasets, develop their own tools, and collaborate in a secure environment enhancing the sharing of data and accelerate discoveries.

You can read more on how we’ve joined forces to fuel scientific discovery in a joint press release from St. Jude here and Microsoft has written a great blog post where you can learn more about Microsoft Genomics Service and the partnership.

We’ll be at this year’s HIMSS 2018 Conference and available at Microsoft’s booth #3832 in Las Vegas, Nevada from March 5th – March 9th, as part of the larger Microsoft patient journey providing solutions in enabling more precise treatment and better patient outcomes.

Visit us at Microsoft booth #3832 and schedule a meeting with our team – email us at

Dot: An Interactive Dot Plot Viewer for Comparative Genomics

Author: Maria Nattestad, Scientific Visualization Lead





Next week, DNAnexus will be at the Plant and Animal Genome conference (PAG) in San Diego (booth 431). As part of an ongoing effort to expand our visualization capabilities, we will present an open-source tool called Dot that helps scientists visualize genome-genome alignments through a rich, interactive dot plot.

In addition to its scientific contribution, Dot encourages community development of new visualization tools by providing a template that can be used for new visualization tools in other areas of bioinformatics. This would allow bioinformaticians to focus on the bioinformatics and visualization without needing to master web programming intricacies such as reading data from local and remote servers, which is all handled by Dot’s modular and reusable inner workings.

Importance of Dot Plots

Constructing a genome assembly is fundamental to studying the biology of a species. In recent years, advances in long-read sequencing and scaffolding technologies have led to unprecedented quality and quantity of genome assemblies. Better reference genomes contribute to better gene annotations, evolutionary understanding, and biotech opportunities.

Comparing new assemblies to existing genomes of related species is crucial to understanding differences between organisms across the tree of life. Genome assemblies are never perfect and always have to be evaluated critically by comparing against other assemblies or reference genomes, whether of the same or a closely related species. Comparative genomics is also how assemblies of two species’ genomes can be compared and contrasted to look for features that represent functional differences or inform the study of their evolution.

The classic method for visualizing genome-genome alignments is the dot plot, which provides an excellent overview of alignments from the perspective of both genomes. Dot plots place the reference genome on one axis and the query genome (that is aligned against the reference) on the other axis. Alignments between the two genomes are placed according to their coordinates on both genomes. Whereas genome browsers (such as IGV and the UCSC Genome Browser) plot data in one dimension on one genome, dot plots use two dimensions to show alignments in two genomes’ coordinates spaces simultaneously. This is necessary when representing large genome alignment data where the query coordinates matter just as much as the reference coordinates for a particular alignment.

However, dot plots have barely changed in the past decade and are still generated from the command-line as static images, limiting detailed investigation. We decided to tackle this problem as an open-source science project at DNAnexus.

Introducing Dot

Here we present Dot, an interactive dot plot viewer that allows genome scientists to visualize genome-genome alignments in order to evaluate new assemblies and perform exploratory comparative genomics.

Dot supports the output of MUMmer’s nucmer aligner the most commonly used software method for aligning genome assemblies. A quick script called converts the delta file to a more streamlined coordinates file with an index that enables Dot to read in more alignments in certain regions on demand.

Interactivity and features

Dot adds a number of useful features on top of the classic dot plot concept. The index enables a quick plot of an overview that includes the longest 1000 alignments. From here, users can zoom in to look at particular regions and load all the alignments for regions of interest.

In addition to showing alignments, Dot allows scientists to load annotations for either or both genomes to show additional context  (e.g. understanding how sequence differences map to gene differences). Annotation tracks are a common feature of one-dimensional genome browsers, but to translate this concept to the two-dimensional dot plot, we enable annotation tracks on both axes. This is a major benefit of Dot that makes it possible to compare gene annotations visually alongside the alignments of the DNA sequences.

Moreover, users can jump to the same region of the reference genome in the UCSC Genome Browser to quickly see additional context for a region of interest. This allows scientists to explore how known repetitive elements in the reference genome could potentially affect assembly quality in specific regions.

Details for developers

By leveraging D3 and canvas in JavaScript, Dot combines the benefits of interactivity with scalability, enabling scientists to explore large genomes. The UI on the right side panel is built using an open-source SuperUI.js [] plugin, and the input handling and basic page navigation is set up through a special VisToolTemplate [] plugin we developed to enable others to create new visualization tools more easily. We encourage developers to utilize and build on Dot and these open-source projects to create their own visualization tools. Dot is very modular and can be used as a template to build new visualization tools. The template handles complex and necessary components like reading input data files from various sources, thereby letting developers focus only on the visualization itself.

Dot is open source

Dot is free to use online at [] and open source at []. For DNAnexus users, there is a package available among the featured projects with (1) an applet for running MUMmer’s nucmer aligner that includes, (2) a shortcut  to Dot to send files from DNAnexus quickly, and (3) example data and results.