What Will it Take to Make Big Data in Biomedicine a Success?

Last week, we attended the Big Data in Biomedicine conference which was held in our backyard at Stanford University. The meeting, hosted by the Stanford School of Medicine and Oxford, pulled in 60 speakers and panelists who covered a broad range of topics including infectious disease, wearable devices, statistics, machine learning, and one of our favorite topics, integrating genome scale data.

The integrating genome scale data panel was moderated by Stanford’s Carlos Bustamante and included panelists Alexis Battle (Johns Hopkins), Robert Gentleman (Genentech), Daniel MacArthur (Harvard), and Gilean McVean (Oxford). One of the more thought-provoking comments came from Gentleman who suggested that to be successful we need to: 1) make data small and fast and 2) make software fast and maintainable. Music to our ears.

In many ways, we have already solved the core issues of scale when it comes to raw compute and storage. Advanced cloud technologies made a virtually limitless number of processors and hard drives cost-effectively available to anyone with an internet connection. Here at DNAnexus, we’ve worked on projects that have spun up more than 21,000 cores on-demand and required approximately 3.3 million core-hours of computational time generating 430TB of results and nearly 1PB of data storage.

However, raw processing and storage resources are just the beginning. To Gentleman’s points, we also need ways to generate datasets that are smaller and more informative. We need to be able to move them around quickly and make them available so collaboration and investigation can happen on interpretation tools that allow results to be reproducible across labs and institutions.

We are working every day at DNAnexus to make these goals a reality. The DNAnexus platform enables researchers to efficiently develop and standardize pipelines, incorporating in-house and industry-recognized tools and reference resources. It offers all of this in a secure, compliant, and cost-effective environment. Quality analysis protocols are consistently applied across remote sites using a common pipeline as data is uploaded to the platform. Data, once uploaded, is immutable, and tools are tightly version-controlled, ensuring reproducibility. Post-analysis data is easily shared with global collaborators via the cloud and can be controlled by a project administrator, and all data access is tracked and logged providing a detailed chain of provenance.

If you are looking for new ways to better manage your pipelines, accelerate your software, and make your data fast, we’d love to talk to you about how the DNAnexus platform can help. We have learned a lot from our customers at biopharmaceutical companies, diagnostic test providers, genome centers, and sequencing service providers and can apply this experience to help you integrate your genome-scale data and accelerate your science.