At Bio-IT World, Genome Centers Dished on Big Data

At the Bio-IT World Conference & Expo last week in Boston, more than 2,500 attendees descended on the World Trade Center to hear about the latest in hardware, analysis, data storage, and much more. The DNAnexus team was out in force, and we were delighted to share updates about our new platform with the many attendees who stopped by our booth.

The conference had a number of excellent keynote talks this year, including Atul Butte from Stanford and Andrew Hopkins from the University of Dundee. We also really enjoyed seeing Steven Salzberg’s acceptance of the Benjamin Franklin Award for Open Access in the Life Sciences — a much deserved honor for one of the veterans of the bioinformatics field.

Perhaps most interesting was a panel discussion about big data featuring members of major genome centers. Panelists included Guy Coates from Sanger, Xing Xu from BGI, Eric Jones from the Broad Institute, and Alexander (Sasha) Zaranek from Harvard Medical School and a company called Clinical Future.

For those of us who remember when it was a big deal to have a terabyte of storage available, it was truly amazing to hear that most of the panelists have 15 petabytes or more of data stored and easily accessible. Still, even with resources like that, some of the panelists encourage their institute members to delete data when possible, such as the unaligned reads from a sequencing run.

Access control is a real problem for managing data at these large centers. Sanger’s Coates said that his institute’s move into the clinical field — complete with consent forms and all the other compliance needs — makes controlling access “a real nightmare” for his team. Jones at the Broad said that this issue basically means people in the field are living on borrowed time as it becomes increasingly important to find the right solution to this challenge. Zaranek noted that Clinical Focus will use the Arvados tool to include security permissions and provenance along with the files to address this issue.

The panelists also specifically discussed cloud computing, with BGI’s Xu saying that the cloud is his center’s main data repository. Still one goal is to facilitate more rapid and efficient exchange of genomic data globally via higher bandwidth, although they have tested this using Aspera. They successfully transferred 24 GB in just 30 seconds across countries, but this feat is not yet economical enough for routine use. Coates said that his group uses cloud options (including Amazon) for research projects, but they are still evaluating how to integrate cloud for the production pipeline in a cost-effective way. At the Broad, Jones said, the need to move to the cloud is understood, but so far internal computing is still enough for institute members; he added that the cloud’s elasticity will ultimately drive adoption, allowing people to run very large jobs that would otherwise interfere with the rest of the institute’s compute resources. Zaranek’s group is using cloud computing from Harvard and from Amazon and said that having both options is incredibly valuable. It will also allow other organizations to access their resources. Coates and Jones said that the real challenge in managing data is when individual researchers start moving data around, because tracking that data and predicting resource needs can become difficult.

These are all issues that we have given a great deal of thought to as we designed and built the new DNAnexus, now available for beta testing. We agree that security and compliance are important components of any compute solution — whether cloud-based or in-house — and that’s why we baked the highest standards right into our new tool. Having flexibility to configure the environment as needed, such as scaling up or down at a moment’s notice, is another key trait of the new platform and one that we believe will be quite useful for scientists in individual labs or at these major genome centers streaming data around the clock.