Providing Bioinformatics Solutions to Address Challenges with Structural Variants

Contributors:  Arkarachai Fungtammasan, Jason Chin, Gigon Bae, Fernanda Foertter, Fritz Sedlazeck, Claudia Fonseca

“Houston, we’ve had a hackathon.”  

And this hackathon has yielded four creative bioinformatics solutions to address the complexities of structural variants.

Recently, Nvidia and DNAnexus jointly sponsored the NCBI Structural Variant Hackathon at Baylor College of Medicine on October 11-13. The event was attended by 45 participants from Baylor College of Medicine, UT Southwestern, Rice University, Stanford, and the Broad Institute. Some guests even traveled all the way from Qatar to attend. 

What is a Structural Variant?

A structural variant refers to any segment of DNA greater than 50 base pairs that has been rearranged in some fashion, whether that be inserted, deleted, duplicated, inverted, or translocated [1]. Structural variants can be contributors to many diseases, including cancer. Yet when compared to single nucleotide variants, our understanding of structural variants isn’t as far along because they are difficult to identify, particularly in short-read sequencing formats.

Leading the Charge.

Ben Busby, Scientific Lead at NCBI, and Fritz Sedlazeck, Assistant Professor from Baylor College of Medicine, led the hackathon as a means to encourage inter-institutional collaboration and thinking to tackle research questions related to structural variants of the genome. 

NCBI Structural Variant Hackathon
Attendees listen intently at the recent NCBI Structural Variant Hackathon.

DNAnexus provided cloud computing credit during the Hackathon, and both DNAnexus and Nvidia sent scientists to support attendees with cloud computing, GPU-accelerated computing, and bioinformatics pipeline construction. The participants had the opportunity to learn how to build workflows using the DNAnexus platform graphical user interface or from a command line using Workflow Design Language (WDL). They also could build reproducible prototypes using Jupyter notebook, a collaborative framework for working in the cloud environment. In addition, they were able to learn how to use graphical processing units, or GPUs, to transform the efficiency of bioinformatics workflows. Incidentally, GPUs were originally designed to support high-quality gaming experiences, but now their utility is being harnessed to facilitate computationally-intensive workflows such as Physics Simulation and Deep-learning AI.

These hackathons are important events that bring together people from different fields and different stages of their careers for three intense days to network, collaborate and tackle important bioinformatics challenges

FRITZ SEDLAZECK, PHD, ASSISTANT PROFESSOR, HUMAN GENOME SEQUENCING CENTER

The event also included an inspirational talk from Richard Gibbs, Director, Baylor College of Medicine Human Genome Sequencing Center, on how the hacking mindset is actively transforming our understanding of genomics. From the mapping of the first human genome to the current era of precision medicine, many great scientific ideas have originated from hacking.

Richard Gibbs Structural Variant Hackathon
Richard Gibbs presenting to the hackathon attendees. He presented in the same room in which many of the meetings for the human reference genome construction took place.
Structural Variant Hackathon HGSC
Hackathon groups visited the Human Genome Sequencing Center. There were many DNA sequencers from a variety of companies such as Illumina, Pacific Biosciences, and Thermo Fisher Scientific.

The 45 participants split into groups and each group went to work brainstorming ideas that they could work on over the next three days. Ideas were pitched to the larger group and refined based on feedback. 

The next three days were devoted to implementing each of the ideas. And this was when the room came alive. With help from the DNAnexus and Nvidia teams who helped groups get started, there was a lot of cross-talk and collaboration between the attendees, many of whom were merging ideas and borrowing from one another’s prototypes. According to Claudia M.B., Carvalho Fonseca, PhD, who was one of the attendees, “ It was fascinating to see the synergy between people from different disciplines — computational biology, bioinformatics, molecular biology, etc.– to work toward common goals.” She added: “The combination of good organization, time constraints, and diverse backgrounds boosted creativity and helped each group develop solutions.” 

The hackathon yielded some of the following innovations.

Fast and efficient QC for multi-sample VCF.

This Python package can perform a rapid evaluation of 2500 sample VCFs in one and a half minutes. Find the package here.

Bioinformatics Presentations

We then made the final presentation to the broader community. Here are some highlights.

Genome mis-assembly detection using structural variant calling.

This quality control tool for metagenomic assemblies uses dxWDL, a workflow development language compiler for the DNAnexus Platform, and Docker, to build workflows and port them across the DNAnexus Platform.

Fast structural variant graph analysis on GPUs.

DNAnexus Bioinformatics Analysis Workflow

This applet, called super-minityper, uses a set of cloud-based workflows for constructing structural variant graphs and mapping reads to them. The super-minityper is implemented as a DNAnexus cloud workflow/applet using dxWDL. For minimap2 + seqwish pipeline, the super-minityper also provides a WDL file where minimap2 is substituted by cudamapper in NVIDIA’s Clara Genomics Analysis SDK for faster analysis using GPU. It also provides a public Docker image (ncbicodeathons/superminityper:dx-wdl-builder-1.0) which enables easy-to-use DNANexus’ dxWDL compiler.

Note: The DNAnexus Platform currently doesn’t support GPU-enabled virtual machines for workflows from a web UI but this support is planned for a future release.

DeNovo Structural Variant

DeNovoSV.

This pipeline identifies and validates de novo structural variants in genomics datasets from trios.

SWIft Genomes in a graph.

This automated pipeline builds graphs quickly using k-mer approach. Generally, building graphs for genomes, or large genomic regions is computationally expensive; however, with a multi-scale approach, this pipelines employs  a simple algorithm and tool to build genome graphs for the human Major histocompatibility complex (MHC) region within three minutes.

The spirit of innovation continued after the hackathon, when a group of attendees visited the Space Center in Houston. There, attendees saw the Saturn V rocket, the same model that helped the Apollo 11 mission travel and walk on the moon.

Bioinformatics Hackathon Saturn V
Hackathon attendees in front of the Saturn V rocket.

The next structural variant hackathon at Baylor College of Medicine will take place on April 19th-21st. For more information or to register, visit:  https://www.hgsc.bcm.edu/events/hackathon

Works Cited
[1] https://www.jax.org/news-and-insights/2018/april/calling-all-structural-variants

Addressing the Complex Storage and Archival Needs of DNA Sequencing Data

DNA Data Archive

Computational biologist and large-scale computational DNA expert, Eric Schatz, estimates that by the year 2025 we will amass between 100 million and 2 billion sequenced human genomes.1 That’s a massive amount of data to use for the purpose of improving human health, but there’s a bit of a catch: we need to find creative solutions for storing these data. These solutions must be economical and must promote rapid retrieval of data when needed for analyses.

Typically, cloud providers, such as Amazon Web Services (AWS) and Microsoft Azure, use a tiered pricing structure. Data needed frequently command a higher storage fee than those needed infrequently and placed in cloud archives, or cold storage.

At DNAnexus, we are committed to supporting the sequencing community with creative solutions, which is why we are proud to announce the upcoming rollout of our new archival service.

The DNAnexus archival service provides a cost-effective and secure way to store files that do not need to be accessed frequently. More importantly, even though the files may be moved to cold-storage, the DNAnexus Platform continuously maintains the data provenance and keeps the meta-information of those files, such as tags and property key-value pairs, searchable. With the DNAnexus archival service, you can locate the files–whether they are archived or live–simply by querying their meta-information. 

With this feature, you can archive individual files, folders, or entire projects. You can also easily unarchive one or more files, folders, or projects when they need to make the data available for further analyses.

Currently, the DNAnexus Archival Service is available via the application program interface (API) in AWS regions only, and you must have a license.

  • To learn more about archiving and unarchiving, click here.
  • To request a license to use this feature, contact sales@dnanexus.com

1. Fleishman G. The Data Storage Demands of Genome Sequencing Will Be Enormous. MIT Technology Review. https://www.technologyreview.com/s/542806/how-do-genome-sequencing-centers-store-such-huge-amounts-of-data/. Published October 26, 2015. Accessed October 3, 2019.

Scaling to Meet the Demands of Rare Disease Genetics

rare disease clinical diagnostics

Approximately 350 million people worldwide face the life-changing impacts of a rare disease. The sheer number of people affected begs the question: are rare diseases really that rare?  According to Jussi Paananen, Chief Technology Officer at Blueprint Genetics, perhaps the only thing rare about them is how difficult they are to diagnose: “it’s like finding a needle in a haystack.”

It’s like finding a needle in a haystack.

Jussi Paananen, CTO Blueprint Genetics

Blueprint Genetics wants the process of diagnosing and understanding rare diseases to be more efficient and less challenging for patients. The global clinical diagnostics company focuses on rare disease genetics and offers single gene tests, panels, and whole exome sequencing in 14 medical specialties, including cardiology, ophthalmology, immunology, hereditary cancer, and malformations, to name a few. The result of Blueprint’s end-to-end testing operation is an easy-to-understand clinical report that physicians can share with their patients.

There’s been a notable surge in rare disease activity over the last few years, driven by reduced sequencing costs and several public sector initiatives that have increased our understanding. Blueprint experienced firsthand this surge and had to scale its operations rapidly to meet demand. When Paananen’s team added an Illumina NovaSeq™ 6000 system to its operations, the volume of samples increased, and so did the amount of data for each sample. By partnering with DNAnexus, Blueprint Genetics was able to scale in velocity and volume by moving its bioinformatics pipelines to the cloud. 

After samples are sequenced on the NovaSeq system, the raw data now runs through Blueprint’s proprietary bioinformatics pipelines on DNAnexus, which produces pre-processed variants that the in-house geneticists interpret and use as the basis for final clinical reports.

Blueprint Genetics: Scaling & Increasing Demand with DNAnexus

According to Paananen, they’ve been able to scale to the increased demand so that it’s no longer the sequencing and bioinformatics that are a bottleneck — it’s the interpretation. But Blueprint has a solution for this challenge as well: augmented intelligence.  Augmented intelligence, in which human expertise remains at the center of decision making with help from computer counterparts, will enable geneticists at Blueprint to automate as much as possible to increase interpretation speed and quality.

The bottle-neck in scaling clinical genetics is no longer in sequencing or bioinformatics, but on clinical interpretation of the data.

To learn more about how Blueprint Genetics has scaled its high-volume clinical diagnostics business, watch the video below:

Works Cited

1.Austin CP, Cutillo CM, Lau LPL, et al. Future of Rare Diseases Research 2017-2027: An IRDiRC Perspective. Clinical and translational science. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5759721/. Published January 2018. Accessed October 1, 2019.