Refining GWAS Results Using Machine Learning

Genome-wide association studies (GWAS) present a viable approach for researchers to identify genetic variations associated with a particular trait. GWAS have already identified several single nucleotide polymorphisms associated with diabetes, Parkinson’s disease, amongst others. However, these comprehensive studies frequently identify large numbers of genetic variants associated with the phenotypes, not all of which are causal. 

Fine mapping, which is a statistical process in which additional data are introduced to the GWAS dataset, enables researchers to prioritize those variants that warrant additional examination. And it also helps them identify which variants narrowly missed the genome wide significance threshold but actually are causal.

But fine mapping is easier said than done. For starters, you have to set up the proper computing environment — one that promotes traceability and reproducibility. Traceability and reproducibility become even more important when you are testing a drug which will potentially enter clinical trials. You also need to assemble the data in a way your fine mapping algorithms expects, which can be challenging. Not to mention the scientific challenges: it’s hard to compare and evaluate models and there are no frameworks that enable you to interact with the models and improve upon them.

The DNAnexus Platform provides end-to-end support for machine learning and also enables you to build and deploy the models such that domain scientists can ask questions and interact with the models themselves.

Join us for our upcoming webinar in which we provide an overview of how to refine your GWAS results using fine mapping. Specifically, by borrowing from Bayesian statistical methods, we present an interactive approach for applying machine learning-based models in fine mapping. Real-life examples will be demonstrated using UK Biobank data on the DNAnexus Platform. Register now.

A Tale of Two Cities: Breaking Down Data Barriers to Deliver Precision Medicine

Delivering Precision Medicine

Advances in genome sequencing technologies and the growing availability of clinical and other health data present an opportunity to make targeted therapies a reality for patients – a true testament that the best of times are on the horizon for the field of precision medicine. 

Leading cancer center, City of Hope, is paving the way with ambitious goals to sequence every cancer patient that walks through their doors. This ultimately will accelerate the discovery of targeted therapies and deliver personalized and high quality care. At ASHG, we invited Linda Bosserman, MD, Assistant Clinical Professor at City of Hope, and Jonathan Keats, PhD, Director of Bioinformatics at the Translational Genomics Research Institute (TGen) and Scientific Director, Briskin Center for Multiple Myeloma at City of Hope, to discuss how they are leading the charge to deliver on this promise.

Dr. Bosserman and Dr. Keats alike believe that the way to achieve these ambitious goals is to un-silo  data stores and bring all data types into one view so the physician can be as informed as possible on the diagnostic details to facilitate discussions of the most effective treatment options in shared decision making with the patient.

Precision medicine in practice  

To achieve the goals set forth by City of Hope, the organization is leveraging DNAnexus’ Apollo Platform to integrate their clinical-rich enterprise data warehouse with genomics, multi-omics, and imaging data. With harmonized datasets, they will be able to seamlessly conduct genome-wide association studies and other analyses to look at the relationship between genotype, phenotype and clinical outcomes.

To diagnose and then present a patient with therapy options, clinicians need access to a myriad of data types. Dr. Bosserman spoke about the need for accurate and complete medical records to determine patient demographics, medications, immunizations, family history, and cancer data by episode, including stage, tumor features and mutations at diagnosis and progression. Only when physicians understand the full scope of the patient journey, can they weigh the costs and benefits of selected treatments and make suggestions for the best therapies or clinical trials to the patient. Breaking down data silos and making this detailed information available to the clinician, especially as we gain the capacity to add validated decision support information is crucial to deliver the most precise and personalized level of cancer care.

Precision medicine at the bedside – wherever that may be

City of Hope is an independent, cutting-edge cancer research and treatment center located near Los Angeles, CA and has expanded to provide their state-of-the-art targeted therapies to patients at 31 clinical network  locations throughout Southern California. City of Hope also works with employers across the country and international physicians and patients to provide access to its internationally renowned cancer expertise by written consultations, coordination of care with their primary clinicians, telemedicine or in-person visits to the campus.

Dr. Keats joined us from TGen, a City of Hope affiliate in Phoenix, Arizona, where they are building a genomics-enabled precision medicine model. TGen is breaking away from the traditional approach of evidence-based medicine using percentages to determine the best treatment option for generalized patients with a certain phenotype. Instead, they are working to determine the most precise treatment path based on what will work for that specific patient‘s tumor type using their comprehensive genomic analysis model and translating that to clinical care.

As a first step, Dr. Keats and his research team at TGen conduct tumor-normal whole exome and tumor RNA sequencing. Analysis is conducted with an internally configured and validated analysis pipeline with a mixture of open-source, commercial, and internal tools to identify single-nucleotide polymorphisms, copy number and structural variations, gene fusions and transcript variations. The team measures DNA and RNA changes for >20,000 genes and applies gene-drug matching to identify potential therapeutic options. Treatment is then recommended by a tumor board based on genomic characteristics of each patient’s tumor. 

Precision Medicine Workflow

Dr. Bosserman and Dr. Keats shared similar methodologies for how they are leveraging rich and varied data sources to gain insights, yet medical centers and clinical research labs continue to face hurdles to translate discoveries into clinical practice. To achieve actionable insights for delivery of precision clinical care, researchers must be equipped with the best possible technologies to determine correlations between real-world clinical and phenotype data and genomic and multi-omics data to better understand the molecular mechanisms of disease.

At DNAnexus, we are committed to solving this problem by bringing together people from different disciplines who are approaching this issue from various angles and supporting initiatives like these that seek to access and explore heterogeneous datasets, make meaningful discoveries faster, and deliver better and more targeted patient care. 

See how DNAnexus Apollo can accelerate your discovery with a molecular precision medicine hub for advanced data harmonization and exploration. 

Designing Bioinformatics Pipelines for Fast Iteration

When genetic tests are ordered, there’s probably little thought as to all of the bioinformatics work required to make the test possible. However, the bioinformatics team at Myriad Genetics understands firsthand just how much work it takes. Myriad Genetics provides diagnostic tests to help physicians understand risk profiles, diagnose medical conditions, or inform treatment decisions. To support their comprehensive test menu and commitment to providing timely and accurate test results, the bioinformatics team at Myriad focuses on optimizing their bioinformatics pipelines. How? By designing pipelines to leverage modularity and computational re-use to make improvements and iterate more quickly. 

Jeffrey Tratner, Director Software Engineering, Bioinformatics at Myriad spoke at DNAnexus Connect, explaining how fast iteration works on the DNAnexus Platform. You can learn more by watching his talk or reading the summary below.


Typical pipeline development involves setting up an infrastructure, building a computation process, and analyzing the results. When adjustments are made, this process repeats as many times as necessary until the pipeline has been properly validated. With complex pipelines, this process can consume many resources and a lot of time. Myriad wanted a more efficient way to iterate on their pipelines so that they could optimize them faster. Fast R&D, as Myriad defines it, is characterized by an environment in which you can make adjustments easily, find answers quickly, and don’t have to think too much or second guess which areas of the pipeline you need to change when making adjustments.

Myriad reduced pipeline R&D from 2 weeks to 2 hours by leveraging tools that enable them to re-use computations.

The team at Myriad first demonstrated this concept when they performed a retrospective analysis with a new background normalization step, the tenth step of a 15-step workflow, on over 100,000 NIPT (non-invasive prenatal test) samples. Simply rerunning the entire modified workflow would have taken 2 weeks.  Instead, Myriad reduced this time to two hours by rethinking the pipeline and leveraging tools that enable them to re-use computations.

Now codified, the approach in use at Myriad enables their team to make changes and iterate quickly, all with a focus on accuracy, reproducibility, and moving validated pipelines into production. 

So how can you borrow from their approach and design your bioinformatics pipelines for faster iteration?

Make computational modules smaller

Although it’s tempting to use monorepositories when coding because they promote sharing, convenience, and low overhead, they don’t promote modularity within pipelines. And modularity is what enables you to scale quickly, re-use steps, and identify/debug problems. Myriad organized all the source code for workflows in monorepositories, but developed smart ways to break the code within them into smaller modules and only build the modules that have been modified.

Take advantage of tools that enable you to reuse computations

If you run an app with the same data and the same input files and parameters, your results should be equivalent. So if you are changing a step downstream, why run all of the steps that come before it if they’ve already been run? The DNAnexus Platform, for example, includes a Smart Reuse feature. This feature enables organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects. By reusing computational results, developers can dramatically speed the development of new workflows and reduce resources spent on testing at scale. To learn more about Smart Reuse visit our DNAnexus documentation here.

Smart Reuse Bioinformatics Pipeline

Use workflow tools to describe dependencies and manage the build process

Workflow tools, such as WDL (Workflow Description Language), make pipelines easier to express and build. With WDL, you can easily describe the module dependencies and track version changes to the workflow. It’s also very natural to integrate Docker with WDL, so if you’re using some sort of open-source container hub, you can simply edit one line in WDL and load a module of a different version with a new docker image. Myriad writes their bioinformatics pipelines in WDL and statically compiles them with dxWDL into DNAnexus workflows, streamlining the build process. Learn more about running Docker containers within DNAnexus apps or dxWDL from our DNAnexus documentation.