The Rising Tide of Genomic Data Points to the Cloud

No other market segment has felt the profound impact of the cloud more than the life sciences industry. In March, a major roadblock was eliminated when the National Institutes of Health lifted its ban on the use of government datasets (dbGaP, TCGA, etc.) in the cloud and updated its security best practices white paper. In the past, individual researchers would download data hosted from a variety of locations, attempt to integrate their own data, and run analyses on their local hardware; a time-consuming endeavor wrought with headaches. This approach has become unsustainable given that data sizes have grown exponentially as the cost of genome sequencing has been driven down. There is now a collective push within the genomics research community to create a cloud commons, something in which DNAnexus wholeheartedly believes.

Just how much data are we talking about? According to recent research, Earth contains around 5.3 x 1037 DNA base pairs. They add: “By analogy, it would require 1021 computers with the mean storage capacity of the world’s four most powerful supercomputers (Tianhe-2, Titan, Sequoia, and K computer) to store this information.”

Platform logoFortunately, no one has been asked to manage the DNA for our entire planet’s biomass, but the research points to a very real challenge. With this rising tide of data comes the need for more computational resources, and the question of whether to build or buy infrastructure comes into play. A recent article in The Platform, takes a fascinating dive into how the genomics community is utilizing life sciences clouds. In her article, author Nicole Hemsoth (@NicoleHemsoth) raises the issue of “what life sciences companies are missing is a management system for dealing with petabytes of data and billions of objects.”

She continues to discuss how as the sophistication of data management, storage, compute, security and compliance features become hardened, bursting into the cloud is the most efficient way to utilize local and cloud resources. While most large-scale genome centers have their own local infrastructure, their workloads tend to occur in spikes. In order to mitigate overprovisioning, genome centers are finding their sweet spot by combining in-house infrastructure with bursting into the cloud. And with the advent of genomics cloud solutions such as DNAnexus, there are ways to seamlessly integrate workloads to the cloud.

Another notable trend we’re seeing is how life science companies like Regeneron Genetics Center are moving to a 100% cloud-based solution. Instead of the heavy investment that comes with managing and maintaining IT infrastructure, companies can invest in intellectual property; shifting their focus to R&D to accelerate medical discovery.

While freeing up bandwidth on building and maintaining local hardware is a big appeal for the cloud, the real reason institutions decide to go with DNAnexus is for its genomics platform’s state-of-the-art compliance and security measures. While it’s true Amazon Web Services offers a lot of built-in features to ensure security and privacy and potentially any skilled engineer can go out and spin up Amazon EC2 instances themselves, when handling personal health information it’s just not that simple. DNAnexus has invested a tremendous amount of resources in creating a genomics platform that complies with ISO 27001 international security standards and provides data provenance to certify all operations can be tracked and reported for up to 6 years.

Just as there isn’t one way to genotype, there isn’t one way to take your data to the cloud. The field is constantly evolving, which means you’re constantly doing variant call bake-offs, working with many different tools to assess whether you are getting the correct variants of interest. The important question to ask is how will you manage all these diverse data and research requirements? Do you want to do it yourself or go with a proven genomics platform that offers a complete set of systems already in place to control and manage your data? DNAnexus can help. Drop us a line when you’re ready.

Help Save the Black Rhino

Ntombi the Rhino
Ntombi the Rhino

An alliance of institutions and individuals has formed to sequence the genome of Ntombi, a black rhinoceros. The members and financial supporters of this alliance hope that their efforts will contribute to the survival of this critically endangered species.

Hunted to Near Extinction
A poaching crisis driven by demand for rhino horns in East Asia, where it is prized as a status symbol and as an ingredient in indigenous medicinal preparations, has reduced the number of living black rhinos to the point where the survival of the species is critically endangered. Three of eight black rhino subspecies have already been hunted to extinction, and just 5,055 animals remain in their native habitat ranges in Namibia and Coastal East Africa.

The Black Rhino Genome Project
The Black Rhino Genome Project intends to sequence the genome of Ntombi, a living member of the “south-central” black rhinoceros subspecies. Using the DNAnexus platform, the project will produce the first fully annotated version of the black rhino genome, publish the findings in a peer-reviewed journal, and make the genome information available as an open-access resource for analysis by the worldwide scientific community.

Black rhino research can then proceed rapidly, advancing scientific understanding of survival-related factors such as disease resistance and reproduction. By extending its scope to the genomic analysis of all eight black rhino subspecies, the Black Rhino Genome Project may also raise the possibility of bringing the three sub-species that no longer exist back from extinction.

An Alliance of Concerned Experts
The alliance at the forefront of the Black Rhino Genome Project is composed of University of Washington Professor of Pathology, Bioengineering, and Cardiology Chuck Murry, MD PhD, whose research interests include tissue engineering, and University of Washington Professor of Genome Sciences Jay Shendure, MD PhD, who has developed a broad range of powerful genome analysis technologies, and Pembient, a Seattle Washington-based biotech company dedicated to ending the threat to endangered species caused by poaching.

Pembient has already shown how technology can help the endangered rhinoceros win the fight for survival against poachers. Using genetic analysis and tissue engineering, Pembient has manufactured rhino horns that are indistinguishable from those cut from killed rhinoceros and sold by poachers. By flooding the rhino horn black market with Pembient’s fabricated product, conservationists hope to take away the commercial incentive for rhinoceros poaching and reduce rhinoceros killings.

Crowdfunding Campaign
The Black Rhino Genome Project has organized a crowdfunding campaign to sequence the genome of the black rhinoceros Ntombi. As of July 8th, the project was 74% funded, with $12,121 pledged against a goal of $16,500. The campaign is hosted by, a crowdfunding platform for scientific projects of merit and importance.

DNAnexus is providing secure, cloud-based DNA sequence data management, analysis, and distribution for the project through the DNAnexus platform, and is, in addition, a direct contributor to the Black Rhino Genome Project crowdfunding campaign. If you would like to join us, visit Sequencing the Black Rhinoceros Genome at for more information.


ENCODE Prepares For The Next Genome Data Explosion

A Challenging And Worthwhile Objective
Last week ENCODE hosted a three-day Research Applications and Users Meeting for the broader research community. The Encyclopedia of DNA Elements (ENCODE) Consortium is a global collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). Members of the ENCODE Consortium are addressing the challenge of creating a catalog of the functional elements of the human genome that will provide a foundation for studying the genomic basis of human biology and disease.

Now entering Phase 3, the ENCODE Consortium project is using next-generation technologies and methods to expand the size and scope of catalog content created in earlier phases. Phase 3 research is expected to produce 10 to 100 times more data than Phase 2, which received widespread media coverage upon its completion.

ENCODE’s Phase 3 analysis will require approximately millions computing core hours and will generate nearly one petabyte of raw data over the next 18 months. It is hardly surprising then, that in addition to keynote lectures and presentations by distinguished speakers, the ENCODE 2015 meeting agenda was packed with hands-on workshops and tutorials designed to provide attendees the knowledge and skills they will need to access, navigate, and analyze ENCODE data. As announced last week, the ENCODE Consortium’s Data Coordination Center (DCC), located at Stanford University, has adopted the DNAnexus platform, enabling virtually unlimited access to data and bioinformatics tools in support of Phase 3 research.

Seamless Live Demo
Many large-scale genomics studies have been limited by the lack of required computational power and collaborative data management infrastructure. The power of the DNAnexus platform was demonstrated firsthand when roughly 150 ENCODE workshop attendees, using the DNAnexus platform, launched the RNA-Seq processing pipeline in unison on a sample dataset. While the ensuing demand for data retrieval and processing would bog down a typical institutional computing resource for hours or days, possibly “freezing out” any additional user requests during that time, the cloud-based DNAnexus platform simply scaled, as usual, to meet demand. Every workshop participant’s workflow was processed at full speed.

encode user meeting

More Than Scale
Although a scalable solution capable of processing thousands of datasets was a key requirement for the DCC, it was not the only factor in its decision. The development of version-controlled ENCODE pipelines is a priority in the current phase of the ENCODE project to ensure that data released to the public are consistently processed. Tasked with centralizing the project’s raw sequencing data with uniform metadata standards and bioinformatics analysis, the DCC will also take advantage of the DNAnexus platform to supply the Consortium with a secure and unified platform already connecting thousands of scientists around the world, and to provide transparency, reproducibility, and data provenance for consistency amongst ENCODE pipelines and results. Stanford has open-sourced the ENCODE pipelines on GitHub, and they are also available in a public project on the DNAnexus platform, along with other public data and pipelines.

Blog Featured Projects






DNAnexus is proud to serve as the central analysis hub for the ENCODE project. We believe cloud-based solutions for genome projects will have a blockbuster impact on accelerating genomic medicine.