Sequence Data: The View from JP Morgan

Last week, Andrew Lee, Vice President of Strategic Operations, and I attended the JP Morgan Healthcare Conference, an annual investor conference here in San Francisco that brings some 25,000 people to the city. The big news this year came from Life Technologies and Illumina, which both announced platforms that will be capable of sequencing an entire human genome in a day. Life Tech in particular noted that its same-day genome sequence will cost $1,000 in reagents — effectively putting an end to a race that began 10 years ago, when scientists first started seriously competing to achieve the $1,000 genome.

With this price point achieved, we expect people to sequence genomes at a much faster clip than ever before. Indeed, a survey from GenomeWeb and Mizuho Securities found that scientists anticipate that sequencing data will increase 32 percent this year over 2011, and increase another 38 percent next year compared to this year. That’s exciting on the data analysis front: As the volume of DNA data grows exponentially, it’s even more important to have a scalable platform to manage, store, and analyze that data securely and efficiently. The GenomeWeb/Mizuho survey also found that people expect to spend more on informatics in 2012 than they did in 2011.

After all, mainstream genome sequencing won’t be possible until the analysis costs come down by orders of magnitude. Even with so much attention on cutting the price tag for sequencing technologies, that didn’t translate to matching improvements for data costs. The race to the $1,000 genome may be over, but as it turns out, that was just the first leg of a relay. Now the baton has been passed to the data management and analysis folks, and it’s our turn to run as fast as we can.

Streaming an entire sequencing center across the Internet

How much next-gen sequencing data do the top genome centers in the world produce? It’s a staggering amount compared to even one year ago: The Broad Institute now has over 50 HiSeq 2000s, and BGI has over 100. Each HiSeq 2000 can sequence two human genomes per week, which means these centers could sequence in excess of 5,000 and 10,000 human genomes per year, respectively.

What would it take to transmit all the sequence data over the Internet? It turns out, surprisingly little. Let’s do some math: Each HiSeq 2000 can sequence 200 Gigabases per run, but takes over a week to do so. Illumina quotes the throughput of the instrument at 25 Gigabases per day, or about 1 Gigabase per hour. With quality scores and some simple compression, each base takes less than 1 byte of storage. In other words, a HiSeq 2000 produces 1 Gigabyte of sequence data per hour, or 290 Kilobytes per second. To put this number in context, today people routinely stream movies over the Internet to their home at a higher bitrate! Yes, these instruments produce a lot of data compared to the previous generation technology, but it’s quite manageable over modern network connections.

Let me go even further: A sequencing center operates sequencing instruments at perhaps 80% efficiency, so 290 Kilobytes/second * 80% * 8 bits/byte (for network transmission) = 2 Megabits per second. That means 50 HiSeq 2000 instruments, or the entire sequencing capacity of the Broad Institute, could fit over a 100 Megabit connection. A Gigabit connection could support four times the sequencing output of the BGI. A much smaller sequencing core, for example one with one or two HiSeq 2000s, can be supported easily with a 10 Megabit connection.

For those of you that operate a sequencing center, it may seem almost ludicrous that this is possible, and there are certainly reasons why these calculations are ideal-case: it’s difficult to get 100% of your connection’s rated bandwidth, your network is often congested from other ongoing activities, individual TCP streams are difficult to scale to Gigabit speeds, etc. And moving further down the analysis pipeline to SAM/BAM files from read mapping and variant calling, the data transfer demands can easily go up 10-fold. But these calculations are nonetheless close to what’s actually achievable today. Even if your actual network throughput is 50% of the ideal, most sequencing centers with a reasonable connection have no bandwidth issues streaming all the sequence data across the Internet.

Why would we want to do this? Once your data has been moved to an outside data center, it opens up tremendous opportunities: You can then decide to store it there long-term, and access it from anywhere in the world. You can give collaborators access to your data instantly. You can tap into vast compute resources available in the cloud, for example 100,000s of CPUs available in Amazon. You’re no longer bound by what your internal computing and networking infrastructure can support, and can grow or shrink your infrastructure as needed. There are so many advantages to moving your sequence data outside your walls, that I’ll leave that discussion for a future blog posting.

Want to test out the bandwidth yourself? It’s easy to do – just sign up for a free account. You’ll be able to upload three samples for free. If you want to upload directly from the sequencing instrument, we can also help you set that up in 10 minutes. Email to find out more information on how to try streaming the data off your instrument to the cloud.

Next-gen sequencing and the cloud – revolutionary, or hype?

It’s been an exciting time for DNAnexus since launching our company at the recent Bio-IT World Conference & Expo in Boston. We’ve spoken with many of you about your experience using DNAnexus and received great feedback, much of which is already finding its way into future releases.

One thing that struck us at Bio-IT World was the pervasiveness of “cloud”: talks filled with discussions about experimenting with cloud, vendor exhibits praising the magic of the cloud, an entire pre-conference workshop dedicated to cloud computing, and a keynote presentation describing the awesomeness of Amazon Web Services by Deepak Singh. It seems the NIH is also aware of this trend, as the NHGRI recently held a workshop in DC to bring together researchers and thought leaders to discuss the impact cloud will have on genome informatics. One year ago, “cloud” wasn’t on the tip of everyone’s tongue like it is today. So is the excitement over cloud mostly hype?

There is certainly skepticism out there, and plenty of negative experiences. Vivien Marx wrote a great story in BioInform (Full disclosure: we were interviewed for the article) highlighting the ongoing debate over cloud computing, and gives examples of real problems people in the field have experienced. The challenges of using cloud are of course not unique to computational biology, and have been discussed for years, for example in this excellent report from the UC Berkeley RAD lab. The term “cloud” conjures up concerns about data transfer issues, security and control, platform lock-in, difficulty managing amorphous compute resources, the reliability of those resources, and over-crowding.

To address this skepticism, let’s first agree upon what we mean by “cloud” because the term is used by some to describe anything that runs in your web browser, while to others it’s just a fashionable marketing tool for IT infrastructure. Our definition for cloud is an elastic and scalable infrastructure for compute, storage, and networking. Elastic means that we can grow or shrink our use of those resources at any time. Scalable means there’s always room to grow your infrastructure. These two traits of cloud computing are incredibly powerful: Do you have 100 jobs to run? Launch 100 compute nodes and run them all in parallel. Pay the same as running them in a serial fashion, but finish in 1/100th the time. Need to store 10 Terabytes of data for a 6-month project? No problem, it’s available, just pay for 60 TB-months of storage. And when the day comes that you need to run 10,000 compute nodes or store 10 Petabytes of data, you don’t have to worry about building out a datacenter – the cloud will scale to those levels!

But as others have said, the cloud is not a utopia. It doesn’t magically support sequence analysis. It can be difficult to use, and your old applications generally won’t run in the cloud. But that’s because the cloud is not a solution in and of itself. It’s an infrastructure, or an engine that you can use to power your applications. And even if the cloud is like a super-charged V12 engine, it won’t take you anywhere by itself. To harness that energy you need to build a vehicle around the engine: the chassis, transmission, wheels, brakes, and steering wheel and console to present a user-friendly interface to the driver. Once you’ve built the car around the engine, suddenly it’s easy to use and hugely enabling.

DNAnexus’ use of the cloud mirrors this: we’ve built a web-based platform on top of the cloud to harness its power. All the sequence analysis and data management tools are available to you through your web browser, and we transparently manage all the cloud resources. Moving data around the cloud, figuring out where and how to store it reliably, launching compute nodes and coordinating their work – all this happens below the surface. We present an intuitive interface to you that removes all the challenges of using the cloud, while passing through all the benefits – tremendous scalability on-demand. Is it possible to build it without the cloud? Yes, but we wouldn’t be able to amortize the infrastructure costs over the thousands of people working with similar data. We wouldn’t be able to charge you a low per-sample cost.

So to answer the question: revolutionary or hype? It’s both. There’s a lot of hype, and as a result there’s understandably skepticism and disappointment. But once you go beyond that and look at the technology it enables, it’s truly revolutionary. DNAnexus’ goal is not to promote the hype. Our goal is to solve the next-gen sequencing data bottleneck. And we happen to use the cloud as a key component of our platform to solve it. As sequencing growth continues to outpace Moore’s law, you can be sure that your need for compute infrastructure will grow tremendously. We’re here to make that growth as painless and cost-effective as possible.

Take a look for yourself. Sign up for a free account today and tell us what you think. Is the cloud hype? Or is it an innovative approach to next-gen DNA sequence analysis and data management?