New features for managing workflows and releasing them to your global network of collaborators

DNAnexus Blog Authors






Computational genomics workflows are regularly used to not only rapidly accelerate R&D in the field of genomics, but they are also increasingly used to make clinical diagnoses tailored to individual genomes. As DNAnexus has grown to support a large network of industries and collaborators, we have noticed that these workflows are often developed and shared across users and organizations on a global scale.

DNAnexus workflows currently provide the core functionality of allowing users to create and execute a computational workflow within a DNAnexus project. However, for users and organizations collaborating on multiple private or public projects, these local workflows may be less suitable for long-term maintenance in the context of larger organizations and consortia.  In recognition of the need to manage workflows with a truly global network of collaborators, we are excited to introduce an additional suite of features that can be applied to objects we call ‘global workflows’.

As with DNAnexus applications, global workflows are published to a global space accessible by authorized users across projects.  Like a Github repository or Docker repository, global workflows are versioned and updated with a globally unique name. Global workflows can be tagged, associated with broad categories (e.g. ‘read mapping’, ‘germline variant calling’, ‘somatic variant calling’, ‘tumor-normal variant calling’), and defined to run across cloud regions and providers.  They can also be developed by a specified set of users and subsequently published or released to a larger set of authorized users who can run but not modify the workflow. Together, these features empower workflow developers to better share and advertise their workflows to a broad set of users and organizations across multiple regions and clouds.

A user creates a global workflow in essentially the same way as a regular workflow (see this tutorial for more details on how to create a global workflow).  In fact, existing workflows on the DNAnexus platform can be converted to global workflows in a straightforward way.  Since workflows written in CWL or WDL can be directly converted into workflows on our platform, these workflows can also be easily converted to global workflows.  As a result, portable workflows can also be imported to our platform and used in a way that meets organizational needs for access control and collaboration at scale.

To illustrate the use of global workflows, we have published a public workflow available to all users of our platform. For example, from our CLI, you can run:

$ dx find globalworkflows
GATK4 Germline Best Practice FASTQ to VCF (hs38DH) (gatk4_germline_bp_fq_hs38dh),

Here, you can see that there is a GATK4 best practice pipeline available for you to use.  You can treat this workflow name like any other global applications on the platform. Examples for how to use these features can be seen in more detail here.

Workflow release management features were built by the Developer Experience team at DNAnexus. Thanks to the DNAnexus Science team for contributions to the design of this feature. Please see our documentation for a tutorial on how to use these features and contact if you have any feedback or questions.

Comparison of BGISEQ 500 to Illumina NovaSeq Data

Andrew Carroll




Last Thursday, BGI uploaded three WGS data sets of NA12878/HG001. Included in this was a challenge to conduct a side-by-side analysis.

BGISEQ TweetAlbert Villela of Cambridge Epigenetix drew our attention to this dataset. Given our love of benchmarking and new technologies, we applied our evaluation frameworks to this data. The technology behind the BGI-SEQ 500 was based on Complete Genomics, which was acquired by BGI.

This uses DNA Nanoballs and a probe-based method of sequencing. We have seen data from this instrument in PE50, PE100, and PE150 formats, with the number indicating the length of paired-end reads (longer reads generally lead to better analysis).

A dataset for HG001 was previously submitted to Genome in a Bottle last year, with PE50 and PE100 reads and DNAnexus conducted an assessment of these data. This analysis indicated a reasonable gap, particularly in Indels. We also demonstrated that by re-training deep learning models on BGISEQ data (as Jason Chin did with Clairvoyante and Pi-Chuan Chang with DeepVariant) it is possible to bridge this gap using only software.

Given that this new dataset is released one year later and uses longer reads, the new set provides a measure of the progress BGI has made on the instrument and the difference in using longer reads. To quickly summarize, this most recent release of BGISEQ demonstrates significant improvements relative to prior data, though a (modest) gap relative to Illumina remains.

Performance Comparisons

We downloaded the 3 WGS sets directly from EBI. All of these data were generated and submitted by BGI – 2 runs from the BGISEQ500 PE150, and one run from NovaSeq6000 presumably operated by BGI. We analyzed each WGS set in its entirety through several standard pipelines – DeepVariant, Sentieon, Strelka2, GATK4, and Freebayes. Mapping was performed by Sentieon, which produces identical output to BWA-MEM and is faster.

Because the Illumina data here was submitted by BGI, we thought it fair to also include Illumina data generated by Illumina, so we used 35X WGS NovaSeq data available from BaseSpace that we have previously included in our Readshift blog.

BGISEQ ComparisonFigure 1 shows the SNP accuracy (combining false positives and false negatives). Lower bars in this chart indicate better performance. Some take-aways:

  1. The gap between Illumina NovaSeq and BGISEQ is quite narrow in these data. The difference between BGISEQ and Illumina BaseSpace for GATK is about the same as the difference in pipeline choice on Illumina data between GATK and DeepVariant. By this logic, if groups are OK in terms of accuracy choosing GATK, they should be OK to consider BGISEQ as well.
  2. The accuracy of the Illumina data available on BaseSpace is greater than the accuracy for the set uploaded by BGI. For SNPs, Illumina’s BaseSpace set is more representative of what we see in the community. The BGI set seems to be a somewhat worse sequencing run. The yellow illumina set is probably the fairer comparison.
  3. Note that the version of DeepVariant is NOT the one tuned with BGISEQ data ( we did not run in the BGISEQ tuned model for this investigation). It would be interesting to see whether performance will improve further on that model.

BGISEQ Comparison 2Figure 2 shows the Indel accuracy. All of the observations made previously apply with the addition of the following:

Based on the error profile of the callers here, the Illumina dataset submitted by BGI is clearly a PCR-positive dataset (based on the much stronger performance of Strelka2 and DeepVariant). Also, the impact of PCR on these data is slightly more pronounced than with other PCR-positive datasets we see.

Presumably, all available BGISEQ options are PCR-positive. However, Illumina provides the option of both PCR-free and PCR-positive preparations. Given this, it seems more fair to recommend a comparison relative to the yellow Illumina-operated NovaSeq. In addition, this is a good time to remind readers that they should be aware of the impact of PCR in sequencing quality and whether their datasets were generated with it.

Breakdown of False Positives and False Negatives

When we do these benchmarks, the most frequent request is to break down false positives and false negatives on the datasets. Figures 3 and 4 show this for SNP and Indels in one of the BGISEQ samples (which is representative of the other as well):

BGISEQ SNP Comparison

BGISEQ SNP Comparison2Computational Performance 

Finally, you may wonder if any of the programs have issues running on BGISEQ data (or run longer). The answer is – not really. Computationally, performance seems similar to what we see in Illumina data:

BGISEQ Core Hours Comparison

Conclusion –

If this newest data is broadly representative of BGISEQ performance, the BGISEQ looks like a technology worth considering. The price points that we have heard second-hand indicate that buyers would consider tradeoffs between less widely adopted and (slightly) less accurate BGISEQ genomes in exchange for better economics. Based on these benchmarks, these differences in accuracy are not so extreme that BGISEQ genomes would be considered very different.

It is important to note that these samples are PE150, the PE50 and PE100 may have worse performance. It is also important to note that these were specifically put out by BGI and likely represent the highest quality runs from the instrument.

Given that it is still early in the lifecycle of the instrument, it will be important to rigorously QC runs until the community has a good feel for the consistency of BGISEQ quality. If anyone else has runs of HG001/HG002/HG005 from the BGISEQ, I would love you to reach out to us ( so we can replicate this analysis from community-driven runs.

Announcing the Winners of Mosaic Microbiome Community Challenge: Strains #1

The application of next-generation sequencing in the study of microbial communities has fueled the rapid growth of interest in microbiome research. However, difficulties with the accuracy of computational analyses of these complex datasets have limited the translation of microbiome science into novel biotherapeutic products. In order to unleash the potential that metagenomics holds for human health, computational methods to identify unique microbial strains must be improved.

The Mosaic Community Challenge: Strains #1, sponsored by the Janssen Research & Development, LLC, through the Janssen Human Microbiome Institute, aims to benchmark and improve the performance of computational tools in analyzing these data, in order to provide better quality profiling of microbiome samples at high resolution. The challenge gave participants the opportunity to validate their bioinformatics tools in realtime on a neutral, unbiased platform, and see how they performed against other industry tools.

Participants of the challenge worked with datasets that were composed of four different sample types: a metagenomics dataset generated from real mouse fecal samples (of known bacterial composition), and three simulated datasets of varying complexity. Besides the challenge dataset, a distinct training dataset, which included the truth files, was provided to enable participants to train and improve their methods. Participants were then able to conduct analysis by either creating their own app on the Mosaic Platform, or by downloading the dataset and running their method in their own system. Over the four-month course of the challenge, participants could take advantage of a “Testing Ground” to get immediate feedback on their work with training datasets before submitting their final challenge entries.

Challenge Winners & Their Methods

We would like to congratulate the winners as well as thank all who participated for helping to take microbiome science to the next level.


CosmosID, a bioinformatics and NGS service laboratory, scored highest in the Profiling part of the challenge. The CosmosID analysis pipeline achieved the highest cumulative F1-score, which is a measure of precision and recall. According to Nur Hasan, Chief Science Officer at CosmosID, the strength of their approach lies with the manually curated database, whose structure follows the phylogenetic hierarchy of all represented microorganisms which enables reliable microbial identification at all taxonomic levels, down to strain-level.

CosmosID’s submission scored the highest in the analysis of the Biological Sample (80%), which was 64% higher than the score of the second submission (48.9%). Interestingly, however, submissions based on the popular Metaphlan tool, performed better across the simulated datasets. The observation that the performance of tools vary based on the source of the sequencing data highlights the importance of benchmarking the tools on both biological and simulated datasets.

Figure 1. Precision/Recall Curve for the winning submission for each of the challenge datasets (to view this chart visit the submissions page on Mosaic).

To interactively compare the Profiling submissions and view Precision Recall Curves, visit the Strains #1 Profiling comparison page. 


Rayan Chikhi, PhD, Computer Scientist at the French National Center for Scientific Research (CNRS) and CRIStAL research center, and an advisor at Clarity Genomics, scored highest in the Assembly part of the challenge by using the Minia assembler to assemble the metagenomic data provided for the challenge. The assembly portion was judged on the total number of aligned bases divided by the reference genome size (Genome Fraction). The winning submission scored well across all other metrics reported in the leaderboard, namely Misassemblies and Mismatches.

Figure 2. Genome fraction scores across 13 biological sample reference strains 

Honorable mentions go to two other participants. Peter McCaffrey came a close second with his DeepBiome submission, while his submitted assemblies were longer than the winning submissions. Additionally, the submissions from Sergey Nurk (Metaspades assembler) had consistently the largest contigs.

To make your own comparisons between the submissions and dive in deeper in the rich comparison data available, visit the Strains #1 Assembly comparison page. 

Learn about the winners’ methods during our webinar confirmed for Tuesday, June 26th at 10am PT (1pm ET).

Want More Ways to Participate in the Mosaic Microbiome Community?

Learn more and get involved at

Visit Us at Microbiome Drug Development Summit!  

DNAnexus will present Translation of Microbiome Research into Clinical Applications, this Friday, June 22nd at 12pm at the Microbiome Drug Development Summit in Boston. Join our talk, and stop by our exhibition table to learn more about DNAnexus microbiome capabilities, and the Mosaic Community Platform & Challenges. Email us to schedule a meeting in advance.

Translation of Microbiome Research into Clinical Applications

  • Crowdsourcing the advancement of microbiome research with the Mosaic Community platform and challenges
  • Considerations for incorporating microbiome data into clinical trials
  • Complying with GLP, 21 CFR Part 11, and more


   Omar Serang, Chief Cloud Officer, DNAnexus

  Michalis Hadjithomas, PhD, Microbiome Lead, DNAnexus