2014: The Year of the Cloud

 

The Chinese New Year is almost upon us — and the Year of the Horse has us thinking about what 2014 will bring to the world of DNA sequencing. We believe that this will turn out to be the year of cloud computing. Here are a Chinese New Yearfew of the trends that we’re watching:

Availability of large-scale genome studies. At one point, the 1000 Genomes Project was operating on a scale all its own. Today, many organizations are participating in large-scale sequencing projects to study thousands or even millions of people. As that data makes its way into the public realm, the demand for computational resources will soar. Accessing, querying, and manipulating these data sets will present a real challenge to IT teams with bursty episodes of unusually high demand mixed with the regular stream they normally see. That’s precisely the kind of environment where cloud computing makes the most sense: having unlimited on-demand compute resources allows IT teams to meet any infrastructure needs without having to spend the money on scaling up internal resources.

The new human reference genome. The Genome Reference Consortium has released build 38 of the human genome (known as GRCh38). This is a major improvement over the last build. Once the reference has been fully annotated, scientists around the world will want to dust off their existing human data sets and realign them to the updated reference to see if there are any new insights to be had. That will mean a short-term, high-intensity spike in demand for computational resources as these massive alignments are processed — in other words, the perfect occasion to try out cloud computing. It’s the cheapest possible way to add extensive computational resources without the long-term commitment to on-premises infrastructure.

Sequencing costs keep falling. The massive genomic studies underway have all been enabled by the rapidly falling cost of DNA sequencing — a trend that promises to continue, thanks to Illumina’s recent announcement and efforts from startups still working to commercialize innovative new methods for sequencing. As sequencing a genome gets ever more affordable, demand for the resources to process and analyze that data will grow at a faster and faster pace. Trying to keep up with this demand will be an uphill battle for IT teams focused only on internal infrastructure, so we see this leading to interest in how cloud computing can help relieve the pressure from those teams to add boxes and storage components.

Growing number of analysis apps. The ecosystem of available tools for performing specific steps or types of DNA analysis is expanding rapidly. As scientists and bioinformaticians find a growing need to build pipelines utilizing a number of these tools, the ease of doing so in a cloud environment will make this option even more appealing.

Here at DNAnexus, we’re eager for what’s to come in 2014. We have a number of collaborations underway with academic and commercial R&D organizations, and we look forward to sharing details about them with our blog readers in the months ahead. Here’s to the Year of the Cloud and a great and productive year for the biomedical community!

At Bio-IT World, Genome Centers Dished on Big Data

BioIT World 2013At the Bio-IT World Conference & Expo last week in Boston, more than 2,500 attendees descended on the World Trade Center to hear about the latest in hardware, analysis, data storage, and much more. The DNAnexus team was out in force, and we were delighted to share updates about our new platform with the many attendees who stopped by our booth.

The conference had a number of excellent keynote talks this year, including Atul Butte from Stanford and Andrew Hopkins from the University of Dundee. We also really enjoyed seeing Steven Salzberg’s acceptance of the Benjamin Franklin Award for Open Access in the Life Sciences — a much deserved honor for one of the veterans of the bioinformatics field.

Perhaps most interesting was a panel discussion about big data featuring members of major genome centers. Panelists included Guy Coates from Sanger, Xing Xu from BGI, Eric Jones from the Broad Institute, and Alexander (Sasha) Zaranek from Harvard Medical School and a company called Clinical Future.

For those of us who remember when it was a big deal to have a terabyte of storage available, it was truly amazing to hear that most of the panelists have 15 petabytes or more of data stored and easily accessible. Still, even with resources like that, some of the panelists encourage their institute members to delete data when possible, such as the unaligned reads from a sequencing run.

Access control is a real problem for managing data at these large centers. Sanger’s Coates said that his institute’s move into the clinical field — complete with consent forms and all the other compliance needs — makes controlling access “a real nightmare” for his team. Jones at the Broad said that this issue basically means people in the field are living on borrowed time as it becomes increasingly important to find the right solution to this challenge. Zaranek noted that Clinical Focus will use the Arvados tool to include security permissions and provenance along with the files to address this issue.

The panelists also specifically discussed cloud computing, with BGI’s Xu saying that the cloud is his center’s main data repository. Still one goal is to facilitate more rapid and efficient exchange of genomic data globally via higher bandwidth, although they have tested this using Aspera. They successfully transferred 24 GB in just 30 seconds across countries, but this feat is not yet economical enough for routine use. Coates said that his group uses cloud options (including Amazon) for research projects, but they are still evaluating how to integrate cloud for the production pipeline in a cost-effective way. At the Broad, Jones said, the need to move to the cloud is understood, but so far internal computing is still enough for institute members; he added that the cloud’s elasticity will ultimately drive adoption, allowing people to run very large jobs that would otherwise interfere with the rest of the institute’s compute resources. Zaranek’s group is using cloud computing from Harvard and from Amazon and said that having both options is incredibly valuable. It will also allow other organizations to access their resources. Coates and Jones said that the real challenge in managing data is when individual researchers start moving data around, because tracking that data and predicting resource needs can become difficult.

These are all issues that we have given a great deal of thought to as we designed and built the new DNAnexus, now available for beta testing. We agree that security and compliance are important components of any compute solution — whether cloud-based or in-house — and that’s why we baked the highest standards right into our new tool. Having flexibility to configure the environment as needed, such as scaling up or down at a moment’s notice, is another key trait of the new platform and one that we believe will be quite useful for scientists in individual labs or at these major genome centers streaming data around the clock.

Meet the new DNAnexus and its Configurable Cloud Infrastructure

dnanexus betaIt’s been a busy first week since we launched the beta of the new DNAnexus, our cloud-based DNA analysis platform designed for bioinformaticians. We’ve been blown away by the number of people who have signed up for the program and provided a lot of very positive and constructive feedback. We encourage all of our beta users to continue to comment on their experience. Request access today and see for yourself what it is all about.

 

configurable cloud infrastructureThis week we’d like to highlight one of the core capabilities of the new DNAnexus platform, the configurable cloud infrastructure, which lets you take full advantage of Amazon’s scalable and cost-effective Web Services. It not only allows you to scale your computational and data storage needs to any level, it is also fully scriptable and allows you to create an analysis solution that fits your specific needs. The benefit is eliminating capacity planning since you can now store and process any data on demand and only pay for what you use.

 

At DNAnexus we have always used the pay-as-you-go model for computational and storage services; this will continue with the new DNAnexus. The benefit of a pay-as-you-go approach is that you can cost-effectively address your needs today and scale up or down as those needs change. Whether you are familiar with or new to sequence data analysis, you can immediately get started with your data analysis projects without any setup costs or capacity planning risks — regardless of how many samples you might have. This is because the new platform, with its configurable infrastructure, processes samples in parallel, resolving resource contention issues among different teams.

 

When we set out to build the new platform, one of the most common requests we heard was for a fully configurable solution — allowing bioinformaticians and computational analysts the ability to run custom programs, tune compute performance through parallelization, and more. All of this is now possible with the new platform, through well-documented APIs and SDK, as these allow rich scripting for any data management, analysis, visualization, or reporting desires.

 

configurable genomics platform

Another advantage of this new infrastructure is that you can now manage and manipulate your data not only via the web interface, but also through the command-line, which is compatible with Linux and Mac OS X. The open and flexible new DNAnexus platform, with its SDK language support, allows you to run any tool in any language and perform platform operations through API bindings in Python, C++, Java, and the Bash Shell. This allows you to fully automate entire workflows from sequencing data upload to analysis and report generation. You may also create best practices workflows that can be easily shared with non-bioinformaticians within or across institutions.

 

In the weeks to come, we’ll explore the many additional capabilities of the new DNAnexus (e.g., the “Extensible Genomics Toolbox”, “Instant Collaboration”, and “Security and Compliance”). In the meantime, please take advantage of our beta program and sign up for your own account and explore firsthand what the new DNAnexus has to offer.