Building a GenomeArk on the Cloud

Arkarachai Fungtammasan Author

On August 28, 2019, the Vertebrate Genome Project (VGP) announced the completion of 100 high-quality assembled, phased, and scaffolded genomes1.  These new reference genomes include a wide variety of mammals, birds, reptiles, amphibians, and fish. The VGP is the multi-institutional and multinational collaboration that attempts to collect samples, develop methodologies, and craft the high-quality genomes of more than ten thousand species of vertebrates. In addition to species conservation, these genomes are irreplaceable resources to understand the biology of genomes and to study evolution at the genome-scale. The genome sequence is serving as a foundation of modern genetic study, and any defects in the assembled genomes could lead to incorrect biological conclusions. To ensure accuracy of data interpretation, the VGP established a new high standard in terms of continuity, completeness, and accuracy for these genomes. This standard requires over half of all assembled base-pairs to be included in contigs of greater than 1 million base pairs and half of all assembled base pairs to be included in scaffolds greater than 10 million base pairs.  Additionally, 90% of base pairs must be assigned to a particular chromosome, the single base error rate must be less than 0.01%, and the genome needs to be phased and annotated. Given that none of the assembled genomes prior to this effort has this qualification, these are ambitious goals to meet on such a large scale. 

Constructing genomes at this unprecedented quality is very challenging. First, the computational requirement for genome assembly is extremely high. For example, a three gigabase mammalian genome needs roughly 30,000 core hours of computing with fluctuating memory and storage requirements to produce the final assembly. Such high computational loads would tax many local cluster environments when assembling a single organism, let alone assembling hundreds per year.  Second, the bioinformatics pipeline for genome assembly and scaffolding is complicated and has to support multiple sequencing technologies. These pipelines are not trivial to set up and control the version across multiple institutions. Third, the existing genomic sequence for these species is very limited, therefore it is difficult to assess the accuracy of the assembly.

A cloud platform like the one offered by DNAnexus can help manage these challenges. 

First, on the cloud, the computing resources can be deployed instantly and torn down after use. Additionally, different configurations of memory, storage, and computational cores can be deployed for each step, so that resource usage is optimized. Our scientists partnered with several lead scientists in this project including Adam Phillippy, Sergey Koren, and Arang Rhie from NIH, Olivier Fedrigo and Giulio Formenti from Rockefeller University, Richard Durbin, Shane McCarthy, and Kerstin Howe from Sanger Institute, Eugene Myers from Max-Planck-Gesellschaft, Erik Garrison from UCSC, and scientists from Pacific Biosciences to design these pipelines and optimize them to perform with high efficiency.

Second, the DNAnexus platform enables version control of the applications and workflows. This ensures that all participants are able to leverage the same tools configured in the same manner, and that the tools include all necessary dependencies.  On a cloud platform like DNAnexus, the developer can encapsulate all the required programs and use the same version of those programs each time the application run.

Third, the data sharing feature of the cloud allows researchers across geographical locations to collaborate and inspect these genomes. For example, one lab may collect samples, another lab may perform the sequencing, and a third lab could perform the assembly. The data generated during this process can be coordinated and shared in a single DNAnexus project. Once the assembly process is complete, the assembly curation team from the Sanger institute plays a vital role in assembly generation by manually curating and correcting these genomes using various technologies and providing feedback on how to adjust the algorithms and pipelines to improve accuracy.

DNAnexus has been involved with the VGP since 2016. Our scientists, Brett Hannigan, Maria Simbirsky, and Arkarachai Fungtammasan, helped in constructing and maintaining the VGP assembly, phasing, scaffolding, and polishing pipelines on the cloud within the DNAnexus platform. The cloud is a suitable deployment system for highly fluctuating computing demand, version control, and collaboration, which address the challenges faced by VGP. Most of these assembled genomes generated by VGP have far exceeded the high standard set in this project.


The VGP group has an open door policy. We would like to encourage everyone who is interested to participate either in the form of donation, sample provider, or scientific collaboration.  For more detail, visit



ASHG 2019: Delivering on the Promise of Precision Medicine

We have our bags packed and ready for Houston! The annual American Society of Human Genetics (ASHG) meeting is one of our most anticipated events of the year and we can’t wait to join more than 6,500 researchers, academics, clinicians, genetic counselors, healthcare providers, and others in Houston, Texas.

This year is all about showcasing exciting advances in precision medicine. Visit DNAnexus at booth #418 for exciting meet-and-greets, product demonstrations, and customer presentations on how translational informatics is accelerating the drug discovery process. We’ll be at our booth throughout the conference but you can also email us to schedule a one-on-one meeting with a DNAnexus scientist

Lunchtime Talk

12:45pm – 2:00pm
Thursday, October 17
Marriott Marquis Houston, Walker Street Hunter’s Creek Room

Come hear how City of Hope, a leading comprehensive cancer center, is operationalizing value to support the translation of clinical research discoveries to the bedside in multiple therapeutic areas. With the DNAnexus ApolloTM Platform, clinicians and researchers are able to query multi-modal datasets – with an array of clinical and genomic attributes to identify and investigate specific cohorts of patients. With TGen (the Translational Genomics Research Institute) providing basic and clinical research capability support, City of Hope will be able to scale up its sequencing programs to accelerate discovery of highly qualified drug targets, identify biomarkers of disease progression and therapy response, and stratify patient populations for future clinical trials, all of which will  inform clinical care and ultimately improve patient outcomes.
RSVP now to reserve your seat and meal.

Guest Speakers

  • Linda Bosserman, MD, Clinical Professor at City of Hope and Editor-in-Chief of ASCO’s Journal of Oncology Practice
  • Jonathan Keats, PhD, Director of Bioinformatics at TGen and Scientific Director of Briskin Center for Multiple Myeloma Research at City of Hope


  • David Fenstermacher, PhD, Vice President of Precision Medicine & Data Science, DNAnexus

Activities in DNAnexus Booth #418

Demo: Unlock the Power of the UK Biobank
Wednesday, October 16, 2-3 p.m. 
Friday, October 18,  10-11 a.m. 

The UK Biobank includes detailed health records, imaging and other health-related data for approximately 500,000 consented patients. With a trove of information available from UK Biobank, researchers face the challenge of extracting insights and translating them into meaningful discoveries. DNAnexus Apollo for UK Biobank, a cloud-based analysis solution, handles the breadth of phenotypic data and the depth of WES data, making data exploration simpler. Join our demo hour to learn how to leverage the powerful cohort browser to explore, analyze, and visualize UK Biobank data to generate test hypotheses for discovery at scale. 

Demo: End-to-End Machine Learning Development on DNAnexus 
Thursday, October 17, 10:30-11:30 a.m.

Stop by our booth to hear Jason Chin, Director of Machine Learning at DNAnexus, discuss the latest AI/DL applications available on DNAnexus.

City of Hope Meet & Greet: Delivering on The Promise of Precision Medicine 
Thursday, October 17, 2-3 p.m.

Come hear how City of Hope is translating clinical research discoveries from bench to bedside with the support of TGen (the Translational Genomics Research Institute), and DNAnexus Apollo. Linda Bosserman, Assistant Clinical Professor at City of Hope, and Jonathan Keats, Director of Bioinformatics at TGen and Scientific Director of the Briskin Center for Multiple Myeloma at City of Hope, will be in the DNAnexus booth, ready to answer questions.

DNAnexus Posters

Wednesday, October 16
2-3 p.m.

Poster NumberTitleAffiliation
1707Comprehensive haplotype resolved MHC
sequences from whole genome shotgun
sequencing from single individual.

Thursday, October 17
2-4 p.m.

Poster NumberTitleAffiliation
The portrait of fully phased assembled
diploid human genome
1663/TA toolkit for accelerating genomic
analysis using NGS index formats.

Friday, October 18
1-3 p.m.

Poster NumberTitleAffiliation
A scalable framework for identifying
genetic variant set associated with
polygenic-traits in UK Biobank

Customer/Partner Talk Featuring DNAnexus

Thursday, October 17
4-5 p.m.

Poster NumberTitleAffiliation
A robust and production-level
approach to haplotype-resolved
assembly of single individuals.

Harvard Medical
School, HGSC,
NIST, PacBio,
Google Genomics,

DNAnexus Scoops European Innovations Award at SCOPE

We’ve always been proud of our innovative technology solutions, and recently we had the opportunity to celebrate our success with our European partners at the Summit for Clinical Ops Executives (SCOPE)-Europe in Barcelona, Spain. 

Preeti McGill, EMEA Account Director, was on hand to accept DNAnexus’ European Innovations Award, sponsored by Clinical Research News. Fellow awardees included the UK’s Guys and St. Thomas’ NHS trust and Tata Consultancy Services. 

The European Innovations Awards  program seeks to recognize outstanding examples of applied strategic innovation—partnerships, deployments, and collaborations that manifestly improve the clinical trial process. 

“These awards celebrate dedication and innovation in clinical research, and the winners chosen highlight the inspiring work being done in the industry,” said Allison Proffitt, Editorial Director of Clinical Research News. “The research community in Europe is increasingly open, and the projects showcased in this year’s award program prove their dedication to excellence.” 

One project in particular, the UK Biobank Consortium Data Delivery and Cohort Browser, was noted for its contribution to the scientific community. Part of the DNAnexus Apollo Platform, the cohort browser was designed to democratize data access to the UK Biobank, which has collected and developed a biospecimen and data resource on over 500,000 individuals. 

In collaboration with a consortium of pharma companies, the Regeneron Genetics Center has undertaken the exome sequencing and analysis of all 500,000 samples, using Apollo to host the dataset and run Regeneron’s software pipeline. We partnered with Regeneron to construct a combined database of the biobank genomic and phenotypic data to explore, with an innovative “geno/pheno cohort browser” user interface that gives diverse teams the ability to browse and build cohorts among 3,000 phenotypic fields and 15,000,000 genomic variants across 100,000 samples. 

“This resource has proven valuable to pharmaceutical companies, healthcare organizations and the scientific community, and has already led to more than 170 publications revealing novel associations with important epidemiological markers,” Clinical Research News noted.  

Data leads to new discoveries

The Sept. 18 award presentation came on the heels of another announcement: a £200 million expansion of the original £34 million pilot programme. The funding — half from government sources and half from biopharmaceutical and healthcare companies (Amgen, AstraZeneca, GlaxoSmithKline and Johnson & Johnson) — will enable each of the biobank’s 500,000 samples to undergo whole genome sequencing at the Wellcome Sanger Institute in Cambridge, UK, and the deCOde site in Iceland. 

“Today’s funding will support one of the world’s most ambitious gene sequencing programmes ever undertaken,” said UK Business Secretary Andrea Leadsom. “Its results could transform the field of genetic repeated research – unlocking the causes of some of the most terrible diseases and how we can best tackle them. It will be a major step forward for individually tailored treatment plans, and will help us better understand why some people get certain diseases while others don’t.”

Data from Regeneron’s exome sequencing efforts have already led to several scientific discoveries. One of the discoveries has been a better understanding of Type 1 diabetes. This is the most common form of diabetes in children, which has led to it being called ‘juvenile diabetes’. However, a study using information from UK Biobank revealed that type 1 diabetes is almost as common in adults as it is in children. This hasn’t been recognized previously because so many adults have a different kind of diabetes. Another study also shed light on the effect of diabetes on the heart. 

Scientists in Australia used the data to examine the prevalence of inbreeding in the UK (~13,200 people born to extreme inbreeding), and scientists at the Broad Institute used it to explore whether there are any genetic links involved in same-sex partnerships. Another team of researchers used both sequencing and imaging data from the UK Biobank to characterize the brain signature and genetic basis of left-handedness. 

“I am incredibly excited by the potential of genomics to change the way we think about disease and healthcare,” said UK Health and Social Care Secretary Matt Hancock. “In an ageing society with an increasing burden of chronic diseases, it is vital that we diagnose earlier, personalise treatment and where possible prevent diseases from occurring altogether. This project will help unlock new treatments and grow our understanding of how genetics affects our risk of disease.”

Researchers interested in applying for access to UK Biobank data should visit

For more information about the UK Biobank Cohort Browser, visit