Genome-wide association studies (GWAS) present a viable approach for researchers to identify genetic variations associated with a particular trait. GWAS have already identified several single nucleotide polymorphisms associated with diabetes, Parkinson’s disease, amongst others. However, these comprehensive studies frequently identify large numbers of genetic variants associated with the phenotypes, not all of which are causal.
Fine mapping, which is a statistical process in which additional data are introduced to the GWAS dataset, enables researchers to prioritize those variants that warrant additional examination. And it also helps them identify which variants narrowly missed the genome wide significance threshold but actuallyare causal.
But fine mapping is easier said than done. For starters, you have to set up the proper computing environment — one that promotes traceability and reproducibility. Traceability and reproducibility become even more important when you are testing a drug which will potentially enter clinical trials. You also need to assemble the data in a way your fine mapping algorithms expects, which can be challenging. Not to mention the scientific challenges: it’s hard to compare and evaluate models and there are no frameworks that enable you to interact with the models and improve upon them.
The DNAnexus Platform provides end-to-end support for machine learning and also enables you to build and deploy the models such that domain scientists can ask questions and interact with the models themselves.
Join us for our upcoming webinar in which we provide an overview of how to refine your GWAS results using fine mapping. Specifically, by borrowing from Bayesian statistical methods, we present an interactive approach for applying machine learning-based models in fine mapping. Real-life examples will be demonstrated using UK Biobank data on the DNAnexus Platform. Register now.
When genetic tests are ordered, there’s probably little thought as to all of the bioinformatics work required to make the test possible. However, the bioinformatics team at Myriad Genetics understands firsthand just how much work it takes. Myriad Genetics provides diagnostic tests to help physicians understand risk profiles, diagnose medical conditions, or inform treatment decisions. To support their comprehensive test menu and commitment to providing timely and accurate test results, the bioinformatics team at Myriad focuses on optimizing their bioinformatics pipelines. How? By designing pipelines to leverage modularity and computational re-use to make improvements and iterate more quickly.
Jeffrey Tratner, Director Software Engineering, Bioinformatics at Myriad spoke at DNAnexus Connect, explaining how fast iteration works on the DNAnexus Platform. You can learn more by watching his talk or reading the summary below.
Typical pipeline development involves setting up an infrastructure, building a computation process, and analyzing the results. When adjustments are made, this process repeats as many times as necessary until the pipeline has been properly validated. With complex pipelines, this process can consume many resources and a lot of time. Myriad wanted a more efficient way to iterate on their pipelines so that they could optimize them faster. Fast R&D, as Myriad defines it, is characterized by an environment in which you can make adjustments easily, find answers quickly, and don’t have to think too much or second guess which areas of the pipeline you need to change when making adjustments.
The team at Myriad first demonstrated this concept when they performed a retrospective analysis with a new background normalization step, the tenth step of a 15-step workflow, on over 100,000 NIPT (non-invasive prenatal test) samples. Simply rerunning the entire modified workflow would have taken 2 weeks. Instead, Myriad reduced this time to two hours by rethinking the pipeline and leveraging tools that enable them to re-use computations.
Now codified, the approach in use at Myriad enables their team to make changes and iterate quickly, all with a focus on accuracy, reproducibility, and moving validated pipelines into production.
So how can you borrow from their approach and design your bioinformatics pipelines for faster iteration?
Make computational modules smaller
Although it’s tempting to use monorepositories when coding because they promote sharing, convenience, and low overhead, they don’t promote modularity within pipelines. And modularity is what enables you to scale quickly, re-use steps, and identify/debug problems. Myriad organized all the source code for workflows in monorepositories, but developed smart ways to break the code within them into smaller modules and only build the modules that have been modified.
Take advantage of tools that enable you to reuse computations
If you run an app with the same data and the same input files and parameters, your results should be equivalent. So if you are changing a step downstream, why run all of the steps that come before it if they’ve already been run? The DNAnexus Platform, for example, includes a Smart Reuse feature. This feature enables organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects. By reusing computational results, developers can dramatically speed the development of new workflows and reduce resources spent on testing at scale. To learn more about Smart Reuse visit our DNAnexus documentation here.
Use workflow tools to describe dependencies and manage the build process
Workflow tools, such as WDL (Workflow Description Language), make pipelines easier to express and build. With WDL, you can easily describe the module dependencies and track version changes to the workflow. It’s also very natural to integrate Docker with WDL, so if you’re using some sort of open-source container hub, you can simply edit one line in WDL and load a module of a different version with a new docker image. Myriad writes their bioinformatics pipelines in WDL and statically compiles them with dxWDL into DNAnexus workflows, streamlining the build process. Learn more about running Docker containers within DNAnexus apps or dxWDL from our DNAnexus documentation.
A structural variant refers to any segment of DNA greater than 50 base pairs that has been rearranged in some fashion, whether that be inserted, deleted, duplicated, inverted, or translocated . Structural variants can be contributors to many diseases, including cancer. Yet when compared to single nucleotide variants, our understanding of structural variants isn’t as far along because they are difficult to identify, particularly in short-read sequencing formats.
Leading the Charge.
Ben Busby, Scientific Lead at NCBI, and Fritz Sedlazeck, Assistant Professor from Baylor College of Medicine, led the hackathon as a means to encourage inter-institutional collaboration and thinking to tackle research questions related to structural variants of the genome.
DNAnexus provided cloud computing credit during the Hackathon, and both DNAnexus and Nvidia sent scientists to support attendees with cloud computing, GPU-accelerated computing, and bioinformatics pipeline construction. The participants had the opportunity to learn how to build workflows using the DNAnexus platform graphical user interface or from a command line using Workflow Design Language (WDL). They also could build reproducible prototypes using Jupyter notebook, a collaborative framework for working in the cloud environment. In addition, they were able to learn how to use graphical processing units, or GPUs, to transform the efficiency of bioinformatics workflows. Incidentally, GPUs were originally designed to support high-quality gaming experiences, but now their utility is being harnessed to facilitate computationally-intensive workflows such as Physics Simulation and Deep-learning AI.
The event also included an inspirational talk from Richard Gibbs, Director, Baylor College of Medicine Human Genome Sequencing Center, on how the hacking mindset is actively transforming our understanding of genomics. From the mapping of the first human genome to the current era of precision medicine, many great scientific ideas have originated from hacking.
The 45 participants split into groups and each group went to work brainstorming ideas that they could work on over the next three days. Ideas were pitched to the larger group and refined based on feedback.
The next three days were devoted to implementing each of the ideas. And this was when the room came alive. With help from the DNAnexus and Nvidia teams who helped groups get started, there was a lot of cross-talk and collaboration between the attendees, many of whom were merging ideas and borrowing from one another’s prototypes. According to Claudia M.B., Carvalho Fonseca, PhD, who was one of the attendees, “ It was fascinating to see the synergy between people from different disciplines — computational biology, bioinformatics, molecular biology, etc.– to work toward common goals.” She added: “The combination of good organization, time constraints, and diverse backgrounds boosted creativity and helped each group develop solutions.”
The hackathon yielded some of the following innovations.
Fast and efficient QC for multi-sample VCF.
This Python package can perform a rapid evaluation of 2500 sample VCFs in one and a half minutes. Find the package here.
We then made the final presentation to the broader community. Here are some highlights.
Genome mis-assembly detection using structural variant calling.
This quality control tool for metagenomic assemblies uses dxWDL, a workflow development language compiler for the DNAnexus Platform, and Docker, to build workflows and port them across the DNAnexus Platform.
Note: The DNAnexus Platform currently doesn’t support GPU-enabled virtual machines for workflows from a web UI but this support is planned for a future release.
This pipeline identifies and validates de novo structural variants in genomics datasets from trios.
SWIft Genomes in a graph.
This automated pipeline builds graphs quickly using k-mer approach. Generally, building graphs for genomes, or large genomic regions is computationally expensive; however, with a multi-scale approach, this pipelines employs a simple algorithm and tool to build genome graphs for the human Major histocompatibility complex (MHC) region within three minutes.
The spirit of innovation continued after the hackathon, when a group of attendees visited the Space Center in Houston. There, attendees saw the Saturn V rocket, the same model that helped the Apollo 11 mission travel and walk on the moon.