The Rising Tide of Genomic Data Points to the Cloud

No other market segment has felt the profound impact of the cloud more than the life sciences industry. In March, a major roadblock was eliminated when the National Institutes of Health lifted its ban on the use of government datasets (dbGaP, TCGA, etc.) in the cloud and updated its security best practices white paper. In the past, individual researchers would download data hosted from a variety of locations, attempt to integrate their own data, and run analyses on their local hardware; a time-consuming endeavor wrought with headaches. This approach has become unsustainable given that data sizes have grown exponentially as the cost of genome sequencing has been driven down. There is now a collective push within the genomics research community to create a cloud commons, something in which DNAnexus wholeheartedly believes.

Just how much data are we talking about? According to recent research, Earth contains around 5.3 x 1037 DNA base pairs. They add: “By analogy, it would require 1021 computers with the mean storage capacity of the world’s four most powerful supercomputers (Tianhe-2, Titan, Sequoia, and K computer) to store this information.”

Platform logoFortunately, no one has been asked to manage the DNA for our entire planet’s biomass, but the research points to a very real challenge. With this rising tide of data comes the need for more computational resources, and the question of whether to build or buy infrastructure comes into play. A recent article in The Platform, takes a fascinating dive into how the genomics community is utilizing life sciences clouds. In her article, author Nicole Hemsoth (@NicoleHemsoth) raises the issue of “what life sciences companies are missing is a management system for dealing with petabytes of data and billions of objects.”

She continues to discuss how as the sophistication of data management, storage, compute, security and compliance features become hardened, bursting into the cloud is the most efficient way to utilize local and cloud resources. While most large-scale genome centers have their own local infrastructure, their workloads tend to occur in spikes. In order to mitigate overprovisioning, genome centers are finding their sweet spot by combining in-house infrastructure with bursting into the cloud. And with the advent of genomics cloud solutions such as DNAnexus, there are ways to seamlessly integrate workloads to the cloud.

Another notable trend we’re seeing is how life science companies like Regeneron Genetics Center are moving to a 100% cloud-based solution. Instead of the heavy investment that comes with managing and maintaining IT infrastructure, companies can invest in intellectual property; shifting their focus to R&D to accelerate medical discovery.

While freeing up bandwidth on building and maintaining local hardware is a big appeal for the cloud, the real reason institutions decide to go with DNAnexus is for its genomics platform’s state-of-the-art compliance and security measures. While it’s true Amazon Web Services offers a lot of built-in features to ensure security and privacy and potentially any skilled engineer can go out and spin up Amazon EC2 instances themselves, when handling personal health information it’s just not that simple. DNAnexus has invested a tremendous amount of resources in creating a genomics platform that complies with ISO 27001 international security standards and provides data provenance to certify all operations can be tracked and reported for up to 6 years.

Just as there isn’t one way to genotype, there isn’t one way to take your data to the cloud. The field is constantly evolving, which means you’re constantly doing variant call bake-offs, working with many different tools to assess whether you are getting the correct variants of interest. The important question to ask is how will you manage all these diverse data and research requirements? Do you want to do it yourself or go with a proven genomics platform that offers a complete set of systems already in place to control and manage your data? DNAnexus can help. Drop us a line when you’re ready.

NIH Security Best Practices Update

NIH logoDNAnexus has always taken a proactive approach to security and compliance. We’ve worked closely in partnership with AWS (Amazon Web Services) to provide our mutual customers best-in-class security for genome informatics and data management in the cloud. These efforts have been acknowledged in the updated publication of the National Institutes of Health Genomic Data Sharing Policies. The updated guidelines make it clear that researchers may use AWS and DNAnexus to store and analyze controlled-access genomic data, including dbGaP and TCGA.

Prior to this policy change, NIH guidance would not allow use of commercial cloud computing services for work involving controlled-access genomic data, which, though it has been stripped of identifying information, may be unique to individuals. With the new NIH policy, such data can be used in the cloud after investigators obtain project-specific approval for its use.

Applications for such approval must include a data security plan. At DNAnexus, we were pleased to discover that a publication from our white paper library detailing our own security compliance practices was listed as an information resource for individuals seeking a working understanding of the requirements.

DNAnexus platform features such as two-factor authentication, end-to-end encryption, need-based network access control, 24/7 security monitoring and updates, audit and access logging allow us to satisfy the new requirements and to exceed the security of many local datacenter installations.

DNAnexus is working with AWS and the NIH to establish mechanisms for processing data access requests and vending the access-controlled data to approved requestors, and we expect to be providing seamless access to these important datasets in DNAnexus.

The combination of data security and powerful computing made possible by the partnership between DNAnexus and AWS has created the ideal platform for global scientific collaboration in genomics. We are thrilled now to be witnessing the beginnings of a genomic discovery “Commons,” where data are brought together and analyzed by researchers around the world.  We will continue to meet and exceed existing security standards, working with our partners to enable new kinds of innovation driven by genomic big data.