Addressing the Complex Storage and Archival Needs of DNA Sequencing Data

DNA Data Archive

Computational biologist and large-scale computational DNA expert, Eric Schatz, estimates that by the year 2025 we will amass between 100 million and 2 billion sequenced human genomes.1 That’s a massive amount of data to use for the purpose of improving human health, but there’s a bit of a catch: we need to find creative solutions for storing these data. These solutions must be economical and must promote rapid retrieval of data when needed for analyses.

Typically, cloud providers, such as Amazon Web Services (AWS) and Microsoft Azure, use a tiered pricing structure. Data needed frequently command a higher storage fee than those needed infrequently and placed in cloud archives, or cold storage.

At DNAnexus, we are committed to supporting the sequencing community with creative solutions, which is why we are proud to announce the upcoming rollout of our new archival service.

The DNAnexus archival service provides a cost-effective and secure way to store files that do not need to be accessed frequently. More importantly, even though the files may be moved to cold-storage, the DNAnexus Platform continuously maintains the data provenance and keeps the meta-information of those files, such as tags and property key-value pairs, searchable. With the DNAnexus archival service, you can locate the files–whether they are archived or live–simply by querying their meta-information. 

With this feature, you can archive individual files, folders, or entire projects. You can also easily unarchive one or more files, folders, or projects when they need to make the data available for further analyses.

Currently, the DNAnexus Archival Service is available via the application program interface (API) in AWS regions only, and you must have a license.

  • To learn more about archiving and unarchiving, click here.
  • To request a license to use this feature, contact

1. Fleishman G. The Data Storage Demands of Genome Sequencing Will Be Enormous. MIT Technology Review. Published October 26, 2015. Accessed October 3, 2019.

Scaling to Meet the Demands of Rare Disease Genetics

rare disease clinical diagnostics

Approximately 350 million people worldwide face the life-changing impacts of a rare disease. The sheer number of people affected begs the question: are rare diseases really that rare?  According to Jussi Paananen, Chief Technology Officer at Blueprint Genetics, perhaps the only thing rare about them is how difficult they are to diagnose: “it’s like finding a needle in a haystack.”

It’s like finding a needle in a haystack.

Jussi Paananen, CTO Blueprint Genetics

Blueprint Genetics wants the process of diagnosing and understanding rare diseases to be more efficient and less challenging for patients. The global clinical diagnostics company focuses on rare disease genetics and offers single gene tests, panels, and whole exome sequencing in 14 medical specialties, including cardiology, ophthalmology, immunology, hereditary cancer, and malformations, to name a few. The result of Blueprint’s end-to-end testing operation is an easy-to-understand clinical report that physicians can share with their patients.

There’s been a notable surge in rare disease activity over the last few years, driven by reduced sequencing costs and several public sector initiatives that have increased our understanding. Blueprint experienced firsthand this surge and had to scale its operations rapidly to meet demand. When Paananen’s team added an Illumina NovaSeq™ 6000 system to its operations, the volume of samples increased, and so did the amount of data for each sample. By partnering with DNAnexus, Blueprint Genetics was able to scale in velocity and volume by moving its bioinformatics pipelines to the cloud. 

After samples are sequenced on the NovaSeq system, the raw data now runs through Blueprint’s proprietary bioinformatics pipelines on DNAnexus, which produces pre-processed variants that the in-house geneticists interpret and use as the basis for final clinical reports.

Blueprint Genetics: Scaling & Increasing Demand with DNAnexus

According to Paananen, they’ve been able to scale to the increased demand so that it’s no longer the sequencing and bioinformatics that are a bottleneck — it’s the interpretation. But Blueprint has a solution for this challenge as well: augmented intelligence.  Augmented intelligence, in which human expertise remains at the center of decision making with help from computer counterparts, will enable geneticists at Blueprint to automate as much as possible to increase interpretation speed and quality.

The bottle-neck in scaling clinical genetics is no longer in sequencing or bioinformatics, but on clinical interpretation of the data.

To learn more about how Blueprint Genetics has scaled its high-volume clinical diagnostics business, watch the video below:

Works Cited

1.Austin CP, Cutillo CM, Lau LPL, et al. Future of Rare Diseases Research 2017-2027: An IRDiRC Perspective. Clinical and translational science. Published January 2018. Accessed October 1, 2019.

DNAnexus Scoops European Innovations Award at SCOPE

We’ve always been proud of our innovative technology solutions, and recently we had the opportunity to celebrate our success with our European partners at the Summit for Clinical Ops Executives (SCOPE)-Europe in Barcelona, Spain. 

Preeti McGill, EMEA Account Director, was on hand to accept DNAnexus’ European Innovations Award, sponsored by Clinical Research News. Fellow awardees included the UK’s Guys and St. Thomas’ NHS trust and Tata Consultancy Services. 

The European Innovations Awards  program seeks to recognize outstanding examples of applied strategic innovation—partnerships, deployments, and collaborations that manifestly improve the clinical trial process. 

“These awards celebrate dedication and innovation in clinical research, and the winners chosen highlight the inspiring work being done in the industry,” said Allison Proffitt, Editorial Director of Clinical Research News. “The research community in Europe is increasingly open, and the projects showcased in this year’s award program prove their dedication to excellence.” 

One project in particular, the UK Biobank Consortium Data Delivery and Cohort Browser, was noted for its contribution to the scientific community. Part of the DNAnexus Apollo Platform, the cohort browser was designed to democratize data access to the UK Biobank, which has collected and developed a biospecimen and data resource on over 500,000 individuals. 

In collaboration with a consortium of pharma companies, the Regeneron Genetics Center has undertaken the exome sequencing and analysis of all 500,000 samples, using Apollo to host the dataset and run Regeneron’s software pipeline. We partnered with Regeneron to construct a combined database of the biobank genomic and phenotypic data to explore, with an innovative “geno/pheno cohort browser” user interface that gives diverse teams the ability to browse and build cohorts among 3,000 phenotypic fields and 15,000,000 genomic variants across 100,000 samples. 

“This resource has proven valuable to pharmaceutical companies, healthcare organizations and the scientific community, and has already led to more than 170 publications revealing novel associations with important epidemiological markers,” Clinical Research News noted.  

Data leads to new discoveries

The Sept. 18 award presentation came on the heels of another announcement: a £200 million expansion of the original £34 million pilot programme. The funding — half from government sources and half from biopharmaceutical and healthcare companies (Amgen, AstraZeneca, GlaxoSmithKline and Johnson & Johnson) — will enable each of the biobank’s 500,000 samples to undergo whole genome sequencing at the Wellcome Sanger Institute in Cambridge, UK, and the deCOde site in Iceland. 

“Today’s funding will support one of the world’s most ambitious gene sequencing programmes ever undertaken,” said UK Business Secretary Andrea Leadsom. “Its results could transform the field of genetic repeated research – unlocking the causes of some of the most terrible diseases and how we can best tackle them. It will be a major step forward for individually tailored treatment plans, and will help us better understand why some people get certain diseases while others don’t.”

Data from Regeneron’s exome sequencing efforts have already led to several scientific discoveries. One of the discoveries has been a better understanding of Type 1 diabetes. This is the most common form of diabetes in children, which has led to it being called ‘juvenile diabetes’. However, a study using information from UK Biobank revealed that type 1 diabetes is almost as common in adults as it is in children. This hasn’t been recognized previously because so many adults have a different kind of diabetes. Another study also shed light on the effect of diabetes on the heart. 

Scientists in Australia used the data to examine the prevalence of inbreeding in the UK (~13,200 people born to extreme inbreeding), and scientists at the Broad Institute used it to explore whether there are any genetic links involved in same-sex partnerships. Another team of researchers used both sequencing and imaging data from the UK Biobank to characterize the brain signature and genetic basis of left-handedness. 

“I am incredibly excited by the potential of genomics to change the way we think about disease and healthcare,” said UK Health and Social Care Secretary Matt Hancock. “In an ageing society with an increasing burden of chronic diseases, it is vital that we diagnose earlier, personalise treatment and where possible prevent diseases from occurring altogether. This project will help unlock new treatments and grow our understanding of how genetics affects our risk of disease.”

Researchers interested in applying for access to UK Biobank data should visit

For more information about the UK Biobank Cohort Browser, visit