2014: The Year of the Cloud


The Chinese New Year is almost upon us — and the Year of the Horse has us thinking about what 2014 will bring to the world of DNA sequencing. We believe that this will turn out to be the year of cloud computing. Here are a Chinese New Yearfew of the trends that we’re watching:

Availability of large-scale genome studies. At one point, the 1000 Genomes Project was operating on a scale all its own. Today, many organizations are participating in large-scale sequencing projects to study thousands or even millions of people. As that data makes its way into the public realm, the demand for computational resources will soar. Accessing, querying, and manipulating these data sets will present a real challenge to IT teams with bursty episodes of unusually high demand mixed with the regular stream they normally see. That’s precisely the kind of environment where cloud computing makes the most sense: having unlimited on-demand compute resources allows IT teams to meet any infrastructure needs without having to spend the money on scaling up internal resources.

The new human reference genome. The Genome Reference Consortium has released build 38 of the human genome (known as GRCh38). This is a major improvement over the last build. Once the reference has been fully annotated, scientists around the world will want to dust off their existing human data sets and realign them to the updated reference to see if there are any new insights to be had. That will mean a short-term, high-intensity spike in demand for computational resources as these massive alignments are processed — in other words, the perfect occasion to try out cloud computing. It’s the cheapest possible way to add extensive computational resources without the long-term commitment to on-premises infrastructure.

Sequencing costs keep falling. The massive genomic studies underway have all been enabled by the rapidly falling cost of DNA sequencing — a trend that promises to continue, thanks to Illumina’s recent announcement and efforts from startups still working to commercialize innovative new methods for sequencing. As sequencing a genome gets ever more affordable, demand for the resources to process and analyze that data will grow at a faster and faster pace. Trying to keep up with this demand will be an uphill battle for IT teams focused only on internal infrastructure, so we see this leading to interest in how cloud computing can help relieve the pressure from those teams to add boxes and storage components.

Growing number of analysis apps. The ecosystem of available tools for performing specific steps or types of DNA analysis is expanding rapidly. As scientists and bioinformaticians find a growing need to build pipelines utilizing a number of these tools, the ease of doing so in a cloud environment will make this option even more appealing.

Here at DNAnexus, we’re eager for what’s to come in 2014. We have a number of collaborations underway with academic and commercial R&D organizations, and we look forward to sharing details about them with our blog readers in the months ahead. Here’s to the Year of the Cloud and a great and productive year for the biomedical community!

One Simple Solution for Ten Simple Rules

plos computational biologyLike many in the systems biology space, we have been longtime fans of Philip Bourne’s Ten Simple Rules articles since the first one was published in PLoS Computational Biology back in 2005. (“Ten Simple Rules for Getting Published,” October 2005.)

The latest installment is especially near and dear to us at DNAnexus: “Ten Simple Rules for Reproducible Computational Research,” written by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. (And edited by Bourne, of course.) The writers begin with the premise that there is a growing need in the community for standards around reproducibility in research, noting that negative trends in paper retractions, clinical trial failures, and papers omitting necessary experimental details have been getting more attention lately.

“This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility,” Sandve et al. write. “In order to foster such aspects, it has been suggested that the scientific community needs to develop a ‘culture of reproducibility’ for computational science, and to require it for published claims.”

The rules begin with the lessons you learned when you got your first lab notebook — “Rule 1: For Every Result, Keep Track of How It Was Produced” — and progress to more complex mandates — “Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds.”

What really stood out for us was that all of these guidelines are addressed by best practices in cloud computing. For example, when we built our new platform, we implemented strict procedures to ensure auditability of data — the system automatically tracks what you did to get a result, ensures version control, serves as an archive of the exact analytical process you used, and stores the raw data underlying analyses. Utilizing a cloud-based pipeline also offers true reproducibility because you can always perform the same analysis again (using the specific version of your pipeline) or make your pipeline publicly accessible, granting anyone else the ability to rerun the analysis.

Be sure to check out all 10 rules, and feel free to take a tour of the DNAnexus platform to see how it can help you achieve reproducibility in your own computational research.

On DNA Day, We’re Thinking About (What Else?) Data

Today is DNA Day! This year it’s an especially big deal as we’re honoring the 60th anniversary of Watson and Crick’s famous discovery of the double-helix structure of DNA as well as the 10th anniversary of the completion of the Human Genome Project.

DNAnexusBack when Watson and Crick were poring over Rosalind Franklin’s DNA radiograph, they never could have imagined the data that would ultimately be generated by scientists reading the sequence of those DNA molecules. Indeed, even 40 years later at the start of the HGP, the data requirements for processing genome sequence would have been staggering to consider.

Check out this handy guide from the National Human Genome Research Institute presenting statistics from the earliest HGP days to today. In 1990, GenBank contained about 49 megabases of sequence; today, that has soared to some 150 terabases. The computational power needed to tackle this amount of genomic data didn’t even exist when the HGP got underway. Consider what kind of computer you were using in 1990: for us, that brings back fond memories of the Apple IIe, mainframes, and the earliest days of Internet (brought to us by Prodigy).

A couple of decades later, we have a far better appreciation for the elastic compute needs for genomic studies. Not only do scientists’ data needs spike and dip depending on where they are in a given experiment, but we all know that the amount of genome data being produced globally will continue to skyrocket. That’s why cloud computing has become such a popular option for sequence data analysis, storage, and management — it’s a simple way for researchers who don’t have massive in-house compute resources to go about their science without having to spend time thinking about IT.

So on DNA Day, we honor those pioneers who launched their unprecedented studies with a leap of faith: that the compute power they needed would somehow materialize in the nick of time. Fortunately, for all of us, that was a gamble that paid off!