Diving Deep into the Data Ocean

At City of Hope, POSEIDON was developed to manage its vast data ocean — an informatics platform fed by so many streams of data, it has created a deep, diverse ecosystem supporting the dynamic precision medicine program at one of the country’s top comprehensive cancer centers.

City of Hope Informatics Platform

In a recent fireside chat with our Chief Revenue Officer Mark Swendsen, City of Hope Senior Vice President and Chief Informatics Officer Sorena Nadaf dove in with details of the massive informatics challenge, and described why the California cancer center selected DNAnexus as its partner for the project. 

A wide pool of patient information   

From real-world data to real-world evidence to real-world action, City of Hope’s vision for its precision oncology program is one of full integration between research and clinical care, pooling a wide variety of data from multiple sources, to be accessed by bioinformaticians and physicians alike. 

Tens of thousands of patients go through the doors of the main City of Hope Duarte campus located outside Los Angeles and its 31 clinical network locations throughout Southern California each year, and each cancer patient has their tumors genetically sequenced to aid with therapeutic options. In addition, the new system will integrate other data affiliated with the patient journey, from disease registries to pathologies, molecular characterization of the tumor, medical record data, and clinical trials information.

Among the requirements of the new platform:

SCALE: “It needed to scale because we were bursting at the seams across our own internal network.” 

COLLABORATION: “Building a network was also critical for us. We needed a way to not stay siloed in our ideas. In academic matrix institutions, we collaborate deeply, across disease programs.”

EASE OF USE: “We needed to make it useful for bioinformaticians who are very technical, but we also needed it easy for oncologists who just wanted a dashboard to look at analytics really quickly, or to be able to browse across various data sets that are specific to their disease.” 

SECURITY: “Data governance and security is a big part of this. It needs to be well managed across a matrix organization.”

Why DNAnexus?

Nadaf and his colleagues spent several months scoping out their precision oncology platform project and quickly realized they couldn’t do it alone. 

“You can’t build it yourself anymore,” Nadaf said. “The time has come and gone where you can throw everything into an Excel spreadsheet or a REDCap database, or buy an infrastructure that’s already built and try to squeeze everything into it.” 

“We really needed an environment that was fine-tuned and revolved around our workflow. And we needed the right experts to help us strategize. Once we ingest all this data, how do we provide this as a solution, primarily for City of Hope, but also our growing network across the landscape of precision medicine institutions?”

Nadaf said he was familiar with DNAnexus thanks to previous work together via the Cancer Informatics for Cancer Centers (Ci4CC).

“It’s been a really good symbiotic relationship for us,” Nadaf said. “I think this partnership is extremely unique and I really do believe that we are onto something very, very special.”

Enabling super science and patient care

The platform is not just a status repository of data, Nadaf said. It’s been an “enabler.” It’s helped inform City of Hope’s unique in-house drug development. It’s helped place patients into clinical trials. It’s been a huge help for tumor boards, where clinicians, researchers, and technical curators come together to make decisions on tricky cases. And it has led to new research ideas, new methods, and new translational projects. 

It has already proved its value in savings in time and energy.

“I think it’s helped us really catapult our initiative a few years, to say the least,” Nadaf said. 

It’s also changed expectations, and opened up new possibilities. 

“It’s no longer a question of can we do this? It’s a matter of, this is exactly what we need, let’s find how to expand our platform to support it.”

Learn more about the partnership here.

DNAnexus R&D Report: Benchmarking Germline Variant Calling Pipelines

Germline Variant Calling

Variant calling is a staple of genome analysis. Developments in computational methods and algorithms have provided a plethora of programs and pipelines for calling secondary variants in both whole genome and whole exome sequence data. With so many options to choose from, picking a pipeline for an analysis project can be challenging.

DNAnexus researchers are helping to narrow that search by benchmarking the performance of popular variant calling pipelines, determining the factors that can affect performance, and matching appropriate pipelines to specific scientific needs. They describe the results of comparing six of these well-known pipelines in a recent report. The report includes detailed data on the versions of algorithms, compute instance type as well as data on runtimes and the accuracy of their variant calls. The team used variant truth sets from the Genome In a Bottle (GIAB) Consortium as the reference for the analysis. This dataset offers high confidence SNPs, INDELs, and homozygous reference regions generated using data from seven human genomes.

To test the pipelines, the research team used whole exome and whole genome sequence reads from NIST. Specifically, they downloaded five whole exome and seven whole genome samples. For the variant truth set, the researchers used high-confidence regions from version 3.3.2 of the NIST dataset. In terms of the pipelines, the team used four off-the-shelf applications from DNAnexus’ tools library, which offers access to containerized software tools and associated dependencies for easy implementation in compute environments. They packaged two additional apps for the comparison tests. The researchers processed each sample from raw reads through to variant calls using default pipeline parameters and settings. They used standards set by the Global Alliance for Genomics and Health (GA4GH) for assessing the accuracy of small germline variant calls. All six pipelines were each run on single AWS instances.

LabelSoftwareVersionAWS Instance Type+Variant Calling AlgorithmDNAnexus App
gatk4BWA-MEM + GATK0.7.17-r1188 + gatk-4.1.4.1c5d.18xlargeGATK HaplotypeCallerUpon request
parabricks_deepvariantParabricks Pipelines DeepVariantv3.0.0_2g4dn.12xlargeDeepVariantpbdeepvariant
parabricks_germlineParabricks Pipelines Germlinev3.0.0_1g4dn.12xlargeGATK HaplotypeCallerpbgermline
sentieon_dnascopeSentieon (DNAscope)sentieon_release_201911c5d.18xlargeDNAscopesentieon_fastq_to_vcf
sentieon_haplotyperSentieon (Haplotyper)sentieon_release_201911c5d.18xlargeGATK HaplotypeCallersentieon_fastq_to_vcf
strelka2Strelka22.9.10c5d.18xlargeStrelka2strelka2_germline
Table 1: Germline variant calling software. We intentionally selected instance types with similar AWS hourly rates.

In general, the pipelines performed comparably. They called SNPs and INDELs in both the whole-exome and whole-genome samples, with over 99% recall and precision. There is some variation between pipelines that is likely due to limitations of the sample data collection methods, as well as the reference build that was used. The DNAnexus researchers called variants against both GRCh38 and hs37d5 builds. The GRCh38 build is more complete but contains lots of repetitive sequences in some genomic regions that pose problems for pipelines, and result in more false negative and false positive calls. Lastly, as noted above, this test used default pipeline parameters. In practice, researchers tweak these depending on the genomic region or loci they are studying. These changes influence pipelines’ precision and recall rates.

The runtimes were also comparable, with most pipelines performing notably faster than GATK4. In practice, these runtimes will vary depending on the type of hardware, the parameters, and the efficiency of the algorithms used. Also, the rates reported here will likely change as developers update their software, and reference datasets continue to evolve and improve. GIAB has already updated at least one of truth datasets that was used in this study. DNAnexus researchers plan to publish a follow-up report to this one that will include updated information and analysis.

As with many things, the best pipeline for an analysis project boils down to research needs and available resources. The pipelines used in this report represent the current state-of-the-art and provide a useful starting point for making decisions about variant calling infrastructure. If you have a particular use case you would like to discuss, please reach out. We would be happy to talk with you.

UK Biobank Democratizes Data Access with its own Cloud-Based Data Analysis Platform

UK Biobank Data Analysis Platform

Among the many lessons learned from the COVID-19 pandemic is that the world remains dangerously exposed to the novel and unknown health risks, despite exponential developments in our ability to improve and prolong life. Access to the right data could help scientists develop faster responses not only for global pandemics, but also for long-term improvements in patient outcomes and quality of life for millions suffering from debilitating diseases. 

Which is why we are happy to collaborate with UK Biobank and Amazon Web Services to unleash the transformative potential of the unparalleled diverse dataset containing phenotypic, genomic, and imaging data from 500,000 volunteers. UK Biobank has enlisted the services of DNAnexus to help develop its own cloud-based Data Analysis Platform. Funding support has come from Wellcome.

As announced earlier this week, the UK Biobank Data Analysis Platform will enable its extensive data resource to be accessed by a far broader range of researchers. The Data Analysis Platform will undergo development and testing through the first half of 2021, with plans to launch to all researchers by summer 2021. 

Over the next five years, UK Biobank data will grow to 15 petabytes — equivalent to the amount of data created annually by the Large Hadron Collider. By developing its own cloud-based Data Analysis Platform, UK Biobank can make the data more easily, securely, and cost-effectively accessible to approved researchers around the world. AWS has offered the UK Biobank computational credits to be awarded as grants by the institution for students and researchers in low to middle income countries to support this work.

“This new platform will democratise access, helping us to unleash the intellects of the world’s best scientific minds – wherever they are – to make discoveries that improve human health.”

Professor Sir Rory Collins, UK Biobank Principal Investigator

DNAnexus has previously worked with UK Biobank and the Regeneron Genetics Center to deliver exome sequencing results back to the UK Biobank Exome Sequencing Consortium (UK-ESC) members, spearheaded by those two organizations. In collaboration with the RGC, DNAnexus also built a web application to enhance the data exploration experience with interactive visualization, filtering, and browsing of integrated phenotypic and genomic information. Designed to democratize data access, the cohort browser allows diverse teams the ability to explore thousands of phenotype fields and millions of genomic variants to rapidly test multiple hypotheses and gain insight into mechanisms of action, biomarkers, and drug targets. The browser was deployed using DNAnexus Apollo, a multi-omics data science platform, which is optimized for large-scale genotype-phenotype datasets such as UK Biobank, TCGA, and other public and proprietary datasets.

This partnership is just one important step towards a world in which data sharing and accessibility is standard. The analysis of UK Biobank data has and will continue to provide a significant contribution to leading-edge developments. We enthusiastically support the foundational UK Biobank project as it breaks new ground in the advancement of disease research through the integration of deep healthcare data with genomics and advanced tools.