Introducing htsget, a new GA4GH protocol for genomic data delivery

DNAnexus is here in Orlando for the fifth plenary meeting of the Global Alliance for Genomics and Health (GA4GH), the standards-making body advancing interoperability and data sharing for genomic medicine. We’re especially pleased this year to join in launching version 1.0 of htsget, a new protocol for the secure web delivery of large genomic datasets, especially whole-genome sequencing reads which can exceed 100 gigabytes per person. 

Htsget complements the incumbent BAM and CRAM file formats for reads, which GA4GH also stewards, and their ecosystem of tools. It adds a standardized protocol for accessing such data over the web, securely, reliably, efficiently, and even federally when needed. Retrieval with htsget is now built into the ubiquitous samtools via its underlying htslib library, allowing bioinformaticians to leverage htsget with most existing tools via a familiar Unix pipe. At the same time, htsget’s streaming parallelism enables scalable ETL into cluster environments like Apache Spark, providing a gradual transition path from incumbent file-based toolchains toward modern “big data” platforms. Lastly, htsget simplifies data access for interactive genome browsers, by unifying authentication and removing the need for index files.

On the server side, htsget has been deployed at the Sanger Institute and the European Genotype Archive; DNAnexus operates a multi-cloud htsget server indexing data within Amazon S3 and Azure Blob storage, which we call htsnexus; Google Cloud Platform has open-sourced their own implementation. Clients can speak a uniform protocol abstracting the diverse authentication and storage schemes of these service providers.

These groups, and others, have all shaped the htsget specification through the GA4GH’s highly collaborative process. But it started in large part with a contribution from DNAnexus, drawing on our experience optimizing how our systems utilize cloud object stores in the huge genome projects we’ve served, such as CHARGE, 1000 Genomes Project, TCGA, and HiSeq X Series data production. Through htsget and other work streams under the new GA4GH Connect framework announced today, DNAnexus looks forward to further contributing from our experience and network to advance the GA4GH’s essential mission.

For more information about how DNAnexus is working with htsget, please contact us at info@dnanexus.com.

DNAnexus at ASHG: Accelerating Your Path from Genomic Data to Insight

We are looking forward to attending the annual American Society of Human Genetics (ASHG) meeting next week in Orlando.  

We’re excited to share updates on recent projects, including our new data analysis and management solution for NovaSeq™ instruments, our collaborative microbiome informatics platform, and the latest software tools available on DNAnexus from our partners at Edico Genome and PacBio.

If you’re headed to ASHG, stop by DNAnexus booth 811 to learn about the broad research and clinical applications of the DNAnexus Platform. Can’t make it to any of our events? Stop by booth 811 anytime during the conference, or email us to schedule a meeting with a member of our team.

Lunchtime Talk 

Optimize Your Path to Variant Production: Real World Examples

Friday, October 20th, 1:00pm-2:15pm
Hilton Orlando Hotel, Lake George Room, Lobby Level

Join our lunchtime discussion to learn about DNAnexus CloudSeq, a powerful solution for rapidly scaling cloud-enabled bioinformatics infrastructure for research and clinical sequencing applications. You will hear case studies from Baylor College of Medicine’s Human Genome Sequencing Center and Rady Children’s Hospital about navigating the complexities of integrating large multi-omic datasets, and developing pipelines to analyze and share data and insights across global R&D organizations.

Guest Speakers:

  • HGSC Baylor College of MedicineWill Salerno, PhD, Director of Genome Informatics, Human Genome Sequencing Center at Baylor College of Medicine
    • Talk: Translation of NIH Data in Discovery Commons
  • Narayanan Veeraraghavan, PhD, Director of IT at Rady Children’s Institute for Genomic Medicine
    • Talk: Creating a Critical Nexus: Making Rapid Whole-Genome (rWGS) Based Precision Medicine Accessible to NICUs and PICUs Across the Country

RSVP today; lunch will be provided.

Booth Activities

Debuting DNAnexus CloudSeq
Stop by to learn about our powerful data analysis and management solution for the NovaSeq™ series of sequencing systems.

  • Wednesday, October 18th, 1:00pm in DNAnexus Booth #811

Edico Genome’s DRAGEN on DNAnexus
See a demo of DRAGEN, Edico Genome’s ultra-rapid, accurate, and cost-efficient genomic data analysis pipeline on DNAnexus. Sign up here or in our booth to take advantage of limited-time promotional pricing on DNAnexus.

  • Wednesday, October 18th,10:40am in Edico Genome Booth #710 
  • Wednesday, October 18th, 2:30pm in DNAnexus Booth #811

PacBio SMRT Analysis Suite 5.0 Available Now on DNAnexus
Test drive PacBio’s SMRT Analysis software on DNAnexus. The suite of SMRT tools includes a comprehensive set of applications for genomic analysis, including de novo assembly, variant calling, transcriptome analysis, epigenomics, and more.

  • Thursday, October 19th, 1:00pm, DNAnexus Booth #811
  • Friday, October 20th, 11:00am, PacBio Booth #722

Join the Microbiome Research Community  
Stop by to learn how to get involved in a series of community challenges aimed at increasing the understanding of the human microbiome and its relation to disease

  • Thursday, October 19th, 2:30pm, DNAnexus Booth #811

Posters Featuring DNAnexus  

PgmNr 745: Access, visualize and analyse pediatric genomic data on St Jude Cloud.

  • Speaker: Scott Newman, St Jude Children’s Research Hospital
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 2563: Improved molecular tracking of individual genomes for clinical whole-genome sequencing.

  • Speaker: Sergey Batalov, Senior Bioinformaticist, Rady Children’s Institute for Genomic Medicine
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 1951: Exome-wide association study of kidney function in 55,041 participants of the DiscovEHR cohort.

  • Speaker: Claudia Schurmann, Statistical Geneticist, Regeneron Genetic Center, Regeneron Pharmaceuticals
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 763: Whole genome sequencing signatures for early detection of cancer via liquid biopsy.

  • Speaker: Bahram Kermani, Founder & CEO, Crystal Genetics
  • Time: Wednesday, October 18th, 2:00pm-3:00pm

View Details

PgmNr 1281: Cloud-based quality measurement of whole-genome cohorts.

  • Speaker: Will Salerno, Human Genome Sequencing Center, Baylor College of Medicine
  • Time: Friday, October 20th, 11:30am-12:30pm

View Details

Removing the NGS Analytics Data Bottleneck with Field-Programmable Gate Arrays (FPGA’s)

Edico Genome’s FPGA-backed DRAGEN Bio-IT Platform Now Available on DNAnexus

The following is a guest blog, written by our partners at Edico Genome.

With rapid adoption across a variety of practices, next-generation sequencing (NGS) is on track to become one of the largest producers of big data by 2025. While the integration of NGS poses exceptional breakthroughs in its applied practices, one major problem threatens its expansion: a lack of computing power to analyze the rapidly growing body of data.

Current projections calculate genomic data to continue doubling every seven months, a stark acceleration in comparison to Moore’s Law, which states CPU capabilities will double every two years. The void left in-between creates a bottleneck for genomics labs.

Designed to uncork this big data bottleneck, Edico Genome’s DRAGEN™ (Dynamic Read Analysis for Genomics) Platform leverages FPGA (Field-Programmable Gate Array) technology to provide customers with hardware-accelerated implementation of genome pipeline algorithms. Leveraging FPGAs, DRAGEN allows customers to analyze NGS data at unprecedented speeds with extremely high accuracy and unwavering dependability.

Uncorking the big data bottleneck with DRAGEN

In contrast to conventional CPU-based systems, which must execute lines of software code to perform an algorithmic function, FPGAs implement algorithms as logic circuits, providing an output almost instantaneously. By replicating these logic circuits thousands of times over, DRAGEN is able to achieve industry-leading speeds by allowing for massive parallelism, unlike CPUs, which are limited to running only one task per core. FPGAs are also fully reconfigurable, enabling customers to switch between functions and pipelines within seconds.

As a result, DRAGEN delivers high accuracy while functioning with industry-leading speed, efficiency, and parallelism. DRAGEN can process an entire human genome at 30x coverage in about 90 minutes, as compared to over 30 hours using a traditional CPU-based system, saving customers time and money. DRAGEN’s Genome Pipeline is now available on DNAnexus at a reduced trial rate until October 31, 2017. To sign up for exclusive promotional pricing, visit: https://www.dnanexus.com/edico-trial .