Bioinformaticians for Good: DNAnexus Aids in Open Science Collaborations In Fight Against Coronavirus

Sequestered in bedrooms, kitchens, and makeshift office spaces around the world, a virtual army of bioinformatic enthusiasts has banded together to do their bit in the battle against coronavirus.

These data doyens are working behind the scenes to build the tools and infrastructure scientists need to better understand and respond to the virus that is currently ravaging the world. They are helping to overcome current obstacles to research, and they are coming up with brand new solutions to a number of challenges related to drug discovery, development, and delivery.

They are gathering online for “hackathons,” which are essentially grassroots, sprint-like design events in which groups of experts (usually in small teams) compete or collaborate intensively over a number of days to solve technical challenges.

Dozens of hackathons devoted to COVID-19 are being launched worldwide, involving thousands of scientists, including many from DNAnexus.

Principal scientist Ben Busby is helping to coordinate several of the events, a role he is familiar with after helping to organize more than 40 hackathons in his former job as genomics outreach coordinator at NCBI.

Hackathons are great ways to generate new ideas, foster intense learning and facilitate data sharing and collaboration among people who would never have met otherwise, he said.

“And the DNAnexus Platform is a great data integration and collaboration solution, especially since it already has security built in,” Busby said.

COVID-19 Biohackathon

April 5-11, 2020

COVID-19 Biohackathon Logo

Hundreds of scientists and bioinformaticians have already signed up for this week-long event, organized by an international team of established biohackers. 

Using video conferencing, e-mail, Slack messaging, wikis, and source code repositories, teams will work together to address more than 20 topics.

Among them is a project to create a coronavirus “pangenome.” This would involve collecting genomic data for SARS-CoV-2 and the five other types of coronavirus known to infect humans (SARS-CoV, HKU1, NL63, OC43 and 229E), and coming up with data comparison and visualization programs in order to help researchers answer questions like: What virulence factors do SARS and CoV have in common? What proteins influence virulence, and are we likely to see a new strain emerge that is even more virulent?

hackseqRNA: COVID-19 Ultra-hackathon

April 4, May 2, May 22-24

RNAseq Hackathon Logo

In this peer-led hackathon, participants are encouraged to propose projects and organizers will help recruit a team of interdisciplinary RNA biologists, bioinformaticians, statisticians, computer scientists, and developers to turn the idea into a reality.

The hackathon entails one-day coding ‘sprints’ to develop hackseqRNA projects and lay the ground-work for research efforts, followed by a final three-day push to bring each project to completion.

Among the projects to be tackled: Using RNA sequencing data to better understand how coronaviruses react with the immune system, how different drugs work within the system, and whether personalized treatments might be developed based on individual patients’ responses.

Amplifying High Impact Potential Research for Novel Coronavirus

April 4-5

Coronavirus High Impact Research Logo

Organized by the non-profit group Research to the People and Stanford University, this hackathon involves several initiatives, from tapping into wearable devices like FitBit and Apple Watch to track infectious diseases and predict the onset of symptoms, to using computational models to simulate the effectiveness of potential coronavirus therapies.

Busby will be leading a team exploring novel genomics research directions for COVID-19, including the potential for using HLA typing to determine a person’s (or population’s) predisposition to COVID-19 severity. A kind of genetic test used to identify certain variations in a person’s immune system, HLA (short for human leukocyte antigen) typing is currently used to identify which people can safely donate bone marrow, cord blood, or an organ to a person who needs a transplant.

Busby’s team will investigate any correlations between HLA types and an individual’s susceptibility to coronavirus, as well as the severity of their reactions when infected. They may also look at potential epidemiological ramifications based on geographic distributions of HLA types. This research will help to lay the groundwork for the HLA COVID-19 Project.

Seeding New Science

Although the projects may not result in complete, polished solutions, they will help seed new science, which can be just as valuable, Busby said.

“This could be a really good opportunity to think about precision medicine in infectious diseases, something that’s been largely ignored before now,” he added. “Treatments would be way more effective if you knew your immune type. It could also help in the assessment of personal susceptibility and risk.”

Busby also notes that even if the current COVID-19 pandemic subsidies, the virus — or others like it — will likely re-emerge.

“Having responses ready when it does re-emerge would be very helpful, but we have to start now,” he said. “Our contribution to the ‘all hands on deck’ coronavirus call is to build an infrastructure so that we will be able to respond in a much better, far more constructive, way.”

Our Commitment to our Customers and Partners During Coronavirus (COVID-19)

Coronavirus Statement

At DNAnexus, customer success is one of our core values and we take pride in delivering a high level of quality and reliability. With the current situation, we wanted to share with you the steps we have taken for the coronavirus (COVID-19) outbreak.

Today, we have no impact nor do we foresee any impacts to the delivery of the DNAnexus Platform and Services. Our cloud-based, multi-tenant architecture and fault-tolerant approach is designed to operate without any service disruptions. Moreover, we have a distributed, highly-skilled workforce committed to supporting the customers and partners around the globe.

Many of you are examining your own preparations to ensure continuation of your work during COVID-19. In the spirit of transparency, we are sharing our plans to ensure the health and safety of members of the DNAnexus community, including our employees, customers and partners.

Travel and Remote Work

To limit the risk of exposure and spread of COVID-19, we have restricted travel for our employees and contractors. We are committed to provide the highest levels of service and engagement through online means and telecommunications.

In addition, we have closed our offices and our employees and contractors are working remotely from home. As a global, cloud-based company, DNAnexus people frequently work from anywhere, and have the resources and tools they need to perform their jobs securely from any location.

Service and Operations

Our cloud service is designed with a high degree of redundancy and fail-over capabilities to reduce the likelihood of significant impact due to an emergency or crisis. We have developed, maintained, and trained our personnel in the policies and procedures for responding to these events, including disaster recovery plans that are part of our FedRAMP Moderate Authority-To-Operate (ATO), our ISO27001 certification, and our HITRUST program.

We continue to closely monitor and consider recommendations from the World Health Organization (WHO), the Centers for Disease Control (CDC), and government and health officials where our employees live and work.

We will stay in touch to keep you updated. Thank you for our continued trust in DNAnexus. If you have any questions or concerns, please do not hesitate to contact us at support@dnanexus.com.

Regards,

Richard Daly
CEO
DNAnexus

BWA-MEM2 Review: Should You Upgrade?

BWA-MEM2 — the Intel accelerated version of BWA-MEM

The Burrow-Wheeler Aligner (BWA-MEM), which requires no introduction, is one of the most popular software tools in the Bioinformatics and Genomics industry. Being the first step short-reads undergo after generated by a sequencing instrument, BWA-MEM has been widely used as a common upstream tool. BWA-MEM generates alignments to a reference genome for a variety of germline and somatic genetic variant detections, such as single-nucleotide polymorphisms (SNPs), indels, structural variations, and copy number variations. With the widespread use of next-generation sequencing technologies, more and more sequencing reads are now being generated for translational research and clinical diagnostics. Scientists need to be able to process these reads more efficiently and cost effectively.

BWA-MEM2, which was recently announced at ipdps2019 (2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)) and published[1] by Vasimuddin Md (@wasim_galaxy), Sanchit Misra (@sanchit_misra), Heng Li (@lh3lh3), and Srinivas Aluru from the Intel Corp., Dana-Farber Cancer Institute and Georgia Institute of Technology. The team optimized algorithms of BWA-MEM for three key kernels: (1) Search for super maxima exact matches (SMEM), (2) Suffix array lookup (SAL), and (3) banded Smith Waterman (BSW) algorithm. Part of the optimizations utilized the AVX-512 instruction, which is for the Intel® Xeon® Skylake processor and an Intel® Xeon® E5 v3 processor, while the other optimizations are generic and can provide performance gains on any of the modern processors.

In this blog post, we would like to share our experiences on performing BWA-MEM2 with the next-generation instance types (c5d) that use the Intel Xeon Scalable Processors (Cascade Lake) or Intel Xeon Platinum 8000 series (Skylake-SP), and utilize the AVX-512 instruction set on a cloud computing environment. We also are demonstrating the comparisons between running BWA-MEM and BWA-MEM2.

Mapping with BWA-MEM vs. BWA-MEM2

To represent reads from a range of different sequencing instruments, library prep, and sequencing labs, we downloaded paired-end FASTQ files from Sequence Read Archive (SRA) by randomly drawing ten samples from each of the Illumina HiSeq 2500, HiSeq 4000, HiSeq X Ten, and NovaSeq 6000. Five of the samples were from whole exome sequencing (WES) library prep and another five were from whole genome sequencing (WGS) library prep, according to their SRA records. All of the samples were submitted to SRA from January to November 2019. 

Five out of the 40 samples we drew didn’t get resolved into paired-end FASTQ files. For simplicity, we discarded the five to benchmark alignment without paired-end reads, which is the most common use case at DNAnexus. The remaining 35 paired-end samples consist of 307 million to 1.59 billion reads. All samples have read lengths distributed from 100bp to 151bp, except one sample (SRR10028120) has its shortest read down to 18bp. Figure 1 shows the distribution of number of reads across samples. 

BWA-MEM2 Distribution Sample Size by Reads
Figure 1. Distribution of size of each sample by number of reads.

We mapped reads against hs38DH which consists of primary contigs + ALT contigs + decoy contigs and HLA genes from the GRCh38 assembly. For performing BWA-MEM, we used the BWA-MEM FASTQ Read Mapper app, which runs BWA-MEM v.0.7.15 while piping the mapped reads into biobambam2 version 2.0.87 for coordinate sorting and marking duplicates. For BWA-MEM2, we used the same pipeline configuration but swapped the BWA-MEM binary to BWA-MEM2 binary built from 1038fe3. Each sample was processed on a single AWS c5d.9xlarge instance utilizing full 36 vCPUs, and the I/O was managed on the DNAnexus Platform.

Commands we used for running BWA-MEM and BWA-MEM2:

# BWA-MEM:
bwa mem -t 36 \
  /path/to/hs38DH.fa \
 /path/to/SAMPLE_R1.fastq.gz \
 /path/to/SAMPLE_R2.fastq.gz \
 -K 100000000 -Y \
 -R '<SAMPLE Read Group INFO>' \
 | bamsormadup SO=coordinate threads=36 level=6 inputformat=sam indexfilename=SAMPLE_NAME.bam.bai M=SAMPLE_NAME.duplication_metrics

# BWA-MEM2:
bwa-mem2 mem -t 36 \
 /path/to/hs38DH.fa \
 /path/to/SAMPLE_R1.fastq.gz \
 /path/to/SAMPLE_R2.fastq.gz \
 -K 100000000 -Y \
 -R '<SAMPLE Read Group INFO>' \
 | bamsormadup SO=coordinate threads=36 level=6 inputformat=sam indexfilename=SAMPLE_NAME.bam.bai M=SAMPLE_NAME.duplication_metrics

Findings

BWA-MEM2 is actively WIP (work in progress) 

With the 35 samples we processed, three of them failed in BWA-MEM2, while they all were mapped smoothly with BWA-MEM. We have reported these events to the bwa-mem2 github page (35, 49, and 50). They were all addressed by Intel’s Parallel Computing Lab prior to this blog posting.

BWA-MEM2 is ~2X faster

In Figure 2, we collected the walk-clock runtime of alignment for all 32 samples that mapped successfully using both BWA-MEM and BWA-MEM2 in multi-threading mode (36 threads). The runtime includes pipeline setup, mapping, sorting, markdup, as well as I/O between a c5d.9xlarge instance and an s3 bucket. Note that we deducted time spent for transferring the reference genome index onto the instance, because the significant larger size of the reference genome index file for BWA-MEM2 (30.27 GiB) comparing to 3.32GiB for BWA-MEM, which roughly took 8 mins and 48 seconds per sample run, respectively.

Not surprisingly, we found the runtime scales linearly with the number of reads for both BWA-MEM and BWA-MEM2. On average, BWA-MEM2 runs 2.07 times faster than BWA-MEM, while the standard deviation of the runtime ratios is 0.66. This trend is similar to what was found in BWA-MEM2’s performance report (link).  With the ~2X acceleration, BWA-MEM2 not only significantly reduces the turnaround time from reads off the sequencing instrument to clinical diagnosis results, but also provides a more cost-effective solution for read alignment.

BWA-MEM2 Run time by Number of Reads
Figure 2. Wall clock runtime vs. number of reads in BWA-MEM and BWA-MEM2.

Discrepancies in BWA-MEM and BWA-MEM2

We compared the mapping results from the two tools to validate if BWA-MEM2 is reproducing the identical mappings as BWA-MEM and to evaluate if there are any discrepancies that could impact downstream genetics variants detections.

We used BamUtil to compare the mapped, sorted and duplicate marked BAMs from BWA-MEM and BWA-MEM2. We counted a read as being reported differently when a read was mapped to different positions, or having different tag values in MD:Z (string for mismatching positions), NM:i (edit distance to the reference), MQ:i (mapping quality of the mate/next segment), RG:Z (readgroup), XA:Z (alternative hits), or XS:i (Suboptimal alignment score).

Two (SRR10270814 and SRR10270774) out of the 32 samples were found to have large amounts (89% and 93%) of discrepancies between the BAMs from BWA-MEM and BWA-MEM2. We realized that the two samples were sequence reads of vancomycin-resistant Enterococcus faecium (VREfm) isolated from stool and blood cultures of patients undergoing chemotherapy or hematopoietic stem cell transplantation, despite the fact we limited our sampling search for only Homo sapiens organisms. We removed these two samples from further comparisons. For the remaining 30 samples, on average, 7.11% of the mapped reads are reported differently in BWA-MEM and BWA-MEM2. Figure 3 shows the percentage  differences of reads reported per sample in BWA-MEM.

BWA-MEM2 Percentage of Reads Mapped
Figure 3. Percentage of reads mapped differently between BWA-MEM and BWA-MEM2.

Where are the differences from? ALT contigs.

For all samples, around half of the differences are either exclusively classified by BWA-MEM or BWA-MEM2. The other half of differences were classified by BWA-MEM reporting all alternative mapping locations (while BWA-MEM2 does not report any of those locations), mate-mapping quality score, suboptimal alignment score, mapping string, CIGAR string, and others (Figure 4).

BWA-MEM2 Classifications of Discrepancy
Figure 4. Classifications of discrepancy between BWA-MEM and BWA-MEM2.

We were surprised about the high number of differences because concordance is crucial for all downstream variant calling analysis. We consulted this discovery with Heng Li, and followed up by repeating the same comparison for the three samples (SRR10150660, SRR10286881, and SRR10286930) that differ the most — this time we mapped the reads against GCA_000001405.15_GRCh38_no_alt_analysis_set, which consists GRCh38 primary contigs and human decoy region (hs38d1), but no ALT contigs nor HLA genes. The differences between these three samples mapped by BWA-MEM and BWA-MEM2 dropped from 11%–18% down to all being less than 0.001%, which probably has a negligible effect. We suspected that mapping to ALT contigs and HLA genes may trigger some corner cases which were not yet properly implemented in BWA-MEM2.

While variant calling for reads mapped to ALT contigs is not yet widely taken for downstream software tools, we look forward to the equivalent ALT contig and HLA-gene alignments in BWA-MEM2 to BWA-MEM.

Final thoughts & discussions

The command line interface of BWA-MEM2 is the same as BWA-MEM — we used the exact same syntax for specifying inputs (reads and the reference genome) and setting parameters. Additionally, the output SAM format is compatible with downstream software tools (i.e. samtools and bamsormadup). In addition, BWA-MEM2 seamlessly determines and runs the most efficient mode among AVX512, AVX2, SSE4.1, or Scalar, depending on what is the fastest mode possible with the instruction set architecture of the instance.

BWA-MEM2 requires a larger reference genome index file. Take hs38DH as an example, it takes up to 30.27GiB storage space in gzipped tarball (tar.gz) format. This is almost one order of magnitude bigger than the same genome index for BWA-MEM (3.32GiB). In some settings for the cloud computing environment, this can cause additional runtime overhead to set up the alignment environment, such as moving the genome index to the instance and extracting the tarball. We also ran into cases while a sample requires more than the memory limit of the c5d.9xlarge instance’s memory capacity (72GiB) and ran out of memory. During our discussion with the Intel team via Github, we learned that we could lower the threads in use with the cost of slower mapping. We look forward to possible future BWA-MEM2 updates to reduce the memory footprints for certain scenarios.

Should you upgrade? BWA-MEM2 shows significant advancement with its 2X speedup, compatible I/O interface to BWA-MEM, seamless support for different chip instructions, and equivalent mapping results as BWA-MEM. While it is provided as “not recommended for production uses at the moment” by the Intel team, we expect to see these short-read alignment upgrades to BWA-MEM2 in the foreseeable future.

Acknowledgements

We’d like to thank Heng Li for his valuable feedback of the BWA-MEM and BWA-MEM2 comparison work and insights about reference genome builds. In addition, we’d like to thank Brett Hannigan (@gildor) for the initial design and discussion of this work.

References

[1] M. Vasimuddin, S. Misra, H. Li and S. Aluru, “Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems,” 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 314-324.