The Rising Tide of Genomic Data Points to the Cloud

No other market segment has felt the profound impact of the cloud more than the life sciences industry. In March, a major roadblock was eliminated when the National Institutes of Health lifted its ban on the use of government datasets (dbGaP, TCGA, etc.) in the cloud and updated its security best practices white paper. In the past, individual researchers would download data hosted from a variety of locations, attempt to integrate their own data, and run analyses on their local hardware; a time-consuming endeavor wrought with headaches. This approach has become unsustainable given that data sizes have grown exponentially as the cost of genome sequencing has been driven down. There is now a collective push within the genomics research community to create a cloud commons, something in which DNAnexus wholeheartedly believes.

Just how much data are we talking about? According to recent research, Earth contains around 5.3 x 1037 DNA base pairs. They add: “By analogy, it would require 1021 computers with the mean storage capacity of the world’s four most powerful supercomputers (Tianhe-2, Titan, Sequoia, and K computer) to store this information.”

Platform logoFortunately, no one has been asked to manage the DNA for our entire planet’s biomass, but the research points to a very real challenge. With this rising tide of data comes the need for more computational resources, and the question of whether to build or buy infrastructure comes into play. A recent article in The Platform, takes a fascinating dive into how the genomics community is utilizing life sciences clouds. In her article, author Nicole Hemsoth (@NicoleHemsoth) raises the issue of “what life sciences companies are missing is a management system for dealing with petabytes of data and billions of objects.”

She continues to discuss how as the sophistication of data management, storage, compute, security and compliance features become hardened, bursting into the cloud is the most efficient way to utilize local and cloud resources. While most large-scale genome centers have their own local infrastructure, their workloads tend to occur in spikes. In order to mitigate overprovisioning, genome centers are finding their sweet spot by combining in-house infrastructure with bursting into the cloud. And with the advent of genomics cloud solutions such as DNAnexus, there are ways to seamlessly integrate workloads to the cloud.

Another notable trend we’re seeing is how life science companies like Regeneron Genetics Center are moving to a 100% cloud-based solution. Instead of the heavy investment that comes with managing and maintaining IT infrastructure, companies can invest in intellectual property; shifting their focus to R&D to accelerate medical discovery.

While freeing up bandwidth on building and maintaining local hardware is a big appeal for the cloud, the real reason institutions decide to go with DNAnexus is for its genomics platform’s state-of-the-art compliance and security measures. While it’s true Amazon Web Services offers a lot of built-in features to ensure security and privacy and potentially any skilled engineer can go out and spin up Amazon EC2 instances themselves, when handling personal health information it’s just not that simple. DNAnexus has invested a tremendous amount of resources in creating a genomics platform that complies with ISO 27001 international security standards and provides data provenance to certify all operations can be tracked and reported for up to 6 years.

Just as there isn’t one way to genotype, there isn’t one way to take your data to the cloud. The field is constantly evolving, which means you’re constantly doing variant call bake-offs, working with many different tools to assess whether you are getting the correct variants of interest. The important question to ask is how will you manage all these diverse data and research requirements? Do you want to do it yourself or go with a proven genomics platform that offers a complete set of systems already in place to control and manage your data? DNAnexus can help. Drop us a line when you’re ready.

NIH Security Best Practices Update

NIH logoDNAnexus has always taken a proactive approach to security and compliance. We’ve worked closely in partnership with AWS (Amazon Web Services) to provide our mutual customers best-in-class security for genome informatics and data management in the cloud. These efforts have been acknowledged in the updated publication of the National Institutes of Health Genomic Data Sharing Policies. The updated guidelines make it clear that researchers may use AWS and DNAnexus to store and analyze controlled-access genomic data, including dbGaP and TCGA.

Prior to this policy change, NIH guidance would not allow use of commercial cloud computing services for work involving controlled-access genomic data, which, though it has been stripped of identifying information, may be unique to individuals. With the new NIH policy, such data can be used in the cloud after investigators obtain project-specific approval for its use.

Applications for such approval must include a data security plan. At DNAnexus, we were pleased to discover that a publication from our white paper library detailing our own security compliance practices was listed as an information resource for individuals seeking a working understanding of the requirements.

DNAnexus platform features such as two-factor authentication, end-to-end encryption, need-based network access control, 24/7 security monitoring and updates, audit and access logging allow us to satisfy the new requirements and to exceed the security of many local datacenter installations.

DNAnexus is working with AWS and the NIH to establish mechanisms for processing data access requests and vending the access-controlled data to approved requestors, and we expect to be providing seamless access to these important datasets in DNAnexus.

The combination of data security and powerful computing made possible by the partnership between DNAnexus and AWS has created the ideal platform for global scientific collaboration in genomics. We are thrilled now to be witnessing the beginnings of a genomic discovery “Commons,” where data are brought together and analyzed by researchers around the world.  We will continue to meet and exceed existing security standards, working with our partners to enable new kinds of innovation driven by genomic big data.

 

100% Cloud-based Genome Center Integrating Large Healthcare Data Flows

photo: The Cancer Genome Atlas
photo: The Cancer Genome Atlas

In a previous post, our new CMO, David Shaywitz, talked about his vision for DNAnexus and its role in helping fulfill the promise of genomic medicine:

“DNAnexus represents a natural home for these aspirations, offering a compelling, secure, cloud-based data management platform, an enabling tool for any healthcare organization – academic medical center, healthcare system, biopharma company, payor – who recognizes that getting a handle on large healthcare data flows is rapidly becoming table stakes, and that figuring out how to manage and leverage genomic data is a wise place to start.”

Fast-forward two months…  This week, we announced exciting progress in our efforts to accelerate genomic medicine.  The DNAnexus cloud-based genome informatics and data management platform is powering a number of collaborations between Regeneron Genetics Center (RGC) and its leading healthcare provider partners.

In a RGC press release, they announced these new collaborators, which include the Geisinger Health System, Columbia University Medical Center, Clinic for Special Children, and Baylor College of Medicine. The RGC will be using the DNAnexus platform to integrate sequencing data with de-identified clinical records from patient volunteers. To date, the RGC has sequenced samples from more than 10,000 individuals and is currently sequencing more than 50,000 samples per year.

The Geisinger collaboration, which has been described as the largest clinical sequencing project in the U.S., is on track to sequence more than 100,000 patient volunteer samples. This DNAnexus-powered initiative has resulted in the first 100% cloud-based biopharma genome center, and is now operating at scale.

Next-generation sequencing technologies, like Illumina’s HiSeq 2500 or X Ten platform, have reduced the cost and increased the speed of DNA sequencing outpacing Moore’s Law to the point where the new bottleneck is genome informatics. To address this issue, companies like Regeneron are adopting cloud-based solutions to handle the massive volume of sequencing data.

DNAnexus provides the technology backbone that enables the sharing and management of data and tools around large volumes of sequencing data between the RGC and its healthcare collaborators. Currently the RGC is processing more than 1,000 exomes per week and sharing the data easily and safely with their collaborators.

In order to improve patient care and ultimately human health, the integration of genomic and phenotypic data needs to happen on a massive scale (something David has recently discussed from the perspective of phenotype here and here). Combining large cohorts of deeply-phenotyped individuals with their genomic data offers a wide range of medical applications, the most obvious being a more personalized approach to medical interventions such as which therapy might work best for a given individual. These data can also be used to aid in the development of new companion diagnostics and clinical trial participant selection. As an article in GigaOM put it this week: Cloud Computing is Coming for Your DNA, and it Will Lead to Better Drugs and Health Care.

These collaborations are powerful examples of how the DNAnexus platform is enabling an integrated approach between biopharmaceutical companies and their partners to accelerate the research and discovery process. As David said, healthcare industry leaders who prioritize the management of large healthcare data flows will emerge as the pioneers who help us realize the full vision of precision medicine –delivery of the optimal therapy to the right patients at the right time – ideally before they are sick.