One Simple Solution for Ten Simple Rules

plos computational biologyLike many in the systems biology space, we have been longtime fans of Philip Bourne’s Ten Simple Rules articles since the first one was published in PLoS Computational Biology back in 2005. (“Ten Simple Rules for Getting Published,” October 2005.)

The latest installment is especially near and dear to us at DNAnexus: “Ten Simple Rules for Reproducible Computational Research,” written by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. (And edited by Bourne, of course.) The writers begin with the premise that there is a growing need in the community for standards around reproducibility in research, noting that negative trends in paper retractions, clinical trial failures, and papers omitting necessary experimental details have been getting more attention lately.

“This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility,” Sandve et al. write. “In order to foster such aspects, it has been suggested that the scientific community needs to develop a ‘culture of reproducibility’ for computational science, and to require it for published claims.”

The rules begin with the lessons you learned when you got your first lab notebook — “Rule 1: For Every Result, Keep Track of How It Was Produced” — and progress to more complex mandates — “Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds.”

What really stood out for us was that all of these guidelines are addressed by best practices in cloud computing. For example, when we built our new platform, we implemented strict procedures to ensure auditability of data — the system automatically tracks what you did to get a result, ensures version control, serves as an archive of the exact analytical process you used, and stores the raw data underlying analyses. Utilizing a cloud-based pipeline also offers true reproducibility because you can always perform the same analysis again (using the specific version of your pipeline) or make your pipeline publicly accessible, granting anyone else the ability to rerun the analysis.

Be sure to check out all 10 rules, and feel free to take a tour of the DNAnexus platform to see how it can help you achieve reproducibility in your own computational research.

On DNA Day, We’re Thinking About (What Else?) Data

Today is DNA Day! This year it’s an especially big deal as we’re honoring the 60th anniversary of Watson and Crick’s famous discovery of the double-helix structure of DNA as well as the 10th anniversary of the completion of the Human Genome Project.

DNAnexusBack when Watson and Crick were poring over Rosalind Franklin’s DNA radiograph, they never could have imagined the data that would ultimately be generated by scientists reading the sequence of those DNA molecules. Indeed, even 40 years later at the start of the HGP, the data requirements for processing genome sequence would have been staggering to consider.

Check out this handy guide from the National Human Genome Research Institute presenting statistics from the earliest HGP days to today. In 1990, GenBank contained about 49 megabases of sequence; today, that has soared to some 150 terabases. The computational power needed to tackle this amount of genomic data didn’t even exist when the HGP got underway. Consider what kind of computer you were using in 1990: for us, that brings back fond memories of the Apple IIe, mainframes, and the earliest days of Internet (brought to us by Prodigy).

A couple of decades later, we have a far better appreciation for the elastic compute needs for genomic studies. Not only do scientists’ data needs spike and dip depending on where they are in a given experiment, but we all know that the amount of genome data being produced globally will continue to skyrocket. That’s why cloud computing has become such a popular option for sequence data analysis, storage, and management — it’s a simple way for researchers who don’t have massive in-house compute resources to go about their science without having to spend time thinking about IT.

So on DNA Day, we honor those pioneers who launched their unprecedented studies with a leap of faith: that the compute power they needed would somehow materialize in the nick of time. Fortunately, for all of us, that was a gamble that paid off!

Dispelling the Myths of the Cloud

cloud computing in genomicsWhat comes to mind when you hear the word “cloud”? Does the Amazon cloud immediately pop into your head?

Despite the cloud’s widespread recognition in the media, many are still uncertain about the benefits of cloud computing.  In a recent national survey, 95% of respondents who claimed they have never used the cloud actually have. In fact, they do so unwittingly nearly everyday via online banking and shopping, social networking, emailing etc.

Some scientists still seem skeptical of the cloud’s place in next-generation sequencing. If you tend to gravitate to skepticism, please read my article, Dispelling the Myths of the Cloud for the Skeptical Scientist, on

It provides an overview of how the cloud can be useful to scientists in a multitude of ways, such as infinite scalability, enabling instant and limitless access to storage and compute resources, eliminating up-front commitment to expensive hardware. Another advantage, Data security, is the cloud’s core competency, which cannot be measured up to any other internal infrastructure.

The “cloud” may be generating a lot of buzz in the NGS community, but is it worthy of all the hype? It appears that all signs point to yes.