How much next-gen sequencing data do the top genome centers in the world produce? It’s a staggering amount compared to even one year ago: The Broad Institute now has over 50 HiSeq 2000s, and BGI has over 100. Each HiSeq 2000 can sequence two human genomes per week, which means these centers could sequence in excess of 5,000 and 10,000 human genomes per year, respectively.
What would it take to transmit all the sequence data over the Internet? It turns out, surprisingly little. Let’s do some math: Each HiSeq 2000 can sequence 200 Gigabases per run, but takes over a week to do so. Illumina quotes the throughput of the instrument at 25 Gigabases per day, or about 1 Gigabase per hour. With quality scores and some simple compression, each base takes less than 1 byte of storage. In other words, a HiSeq 2000 produces 1 Gigabyte of sequence data per hour, or 290 Kilobytes per second. To put this number in context, today people routinely stream movies over the Internet to their home at a higher bitrate! Yes, these instruments produce a lot of data compared to the previous generation technology, but it’s quite manageable over modern network connections.
Let me go even further: A sequencing center operates sequencing instruments at perhaps 80% efficiency, so 290 Kilobytes/second * 80% * 8 bits/byte (for network transmission) = 2 Megabits per second. That means 50 HiSeq 2000 instruments, or the entire sequencing capacity of the Broad Institute, could fit over a 100 Megabit connection. A Gigabit connection could support four times the sequencing output of the BGI. A much smaller sequencing core, for example one with one or two HiSeq 2000s, can be supported easily with a 10 Megabit connection.
For those of you that operate a sequencing center, it may seem almost ludicrous that this is possible, and there are certainly reasons why these calculations are ideal-case: it’s difficult to get 100% of your connection’s rated bandwidth, your network is often congested from other ongoing activities, individual TCP streams are difficult to scale to Gigabit speeds, etc. And moving further down the analysis pipeline to SAM/BAM files from read mapping and variant calling, the data transfer demands can easily go up 10-fold. But these calculations are nonetheless close to what’s actually achievable today. Even if your actual network throughput is 50% of the ideal, most sequencing centers with a reasonable connection have no bandwidth issues streaming all the sequence data across the Internet.
Why would we want to do this? Once your data has been moved to an outside data center, it opens up tremendous opportunities: You can then decide to store it there long-term, and access it from anywhere in the world. You can give collaborators access to your data instantly. You can tap into vast compute resources available in the cloud, for example 100,000s of CPUs available in Amazon. You’re no longer bound by what your internal computing and networking infrastructure can support, and can grow or shrink your infrastructure as needed. There are so many advantages to moving your sequence data outside your walls, that I’ll leave that discussion for a future blog posting.
Want to test out the bandwidth yourself? It’s easy to do – just sign up for a free account. You’ll be able to upload three samples for free. If you want to upload directly from the sequencing instrument, we can also help you set that up in 10 minutes. Email firstname.lastname@example.org to find out more information on how to try streaming the data off your instrument to the cloud.