Introducing htsget, a new GA4GH protocol for genomic data delivery

DNAnexus is here in Orlando for the fifth plenary meeting of the Global Alliance for Genomics and Health (GA4GH), the standards-making body advancing interoperability and data sharing for genomic medicine. We’re especially pleased this year to join in launching version 1.0 of htsget, a new protocol for the secure web delivery of large genomic datasets, especially whole-genome sequencing reads which can exceed 100 gigabytes per person. 

Htsget complements the incumbent BAM and CRAM file formats for reads, which GA4GH also stewards, and their ecosystem of tools. It adds a standardized protocol for accessing such data over the web, securely, reliably, efficiently, and even federally when needed. Retrieval with htsget is now built into the ubiquitous samtools via its underlying htslib library, allowing bioinformaticians to leverage htsget with most existing tools via a familiar Unix pipe. At the same time, htsget’s streaming parallelism enables scalable ETL into cluster environments like Apache Spark, providing a gradual transition path from incumbent file-based toolchains toward modern “big data” platforms. Lastly, htsget simplifies data access for interactive genome browsers, by unifying authentication and removing the need for index files.

On the server side, htsget has been deployed at the Sanger Institute and the European Genotype Archive; DNAnexus operates a multi-cloud htsget server indexing data within Amazon S3 and Azure Blob storage, which we call htsnexus; Google Cloud Platform has open-sourced their own implementation. Clients can speak a uniform protocol abstracting the diverse authentication and storage schemes of these service providers.

These groups, and others, have all shaped the htsget specification through the GA4GH’s highly collaborative process. But it started in large part with a contribution from DNAnexus, drawing on our experience optimizing how our systems utilize cloud object stores in the huge genome projects we’ve served, such as CHARGE, 1000 Genomes Project, TCGA, and HiSeq X Series data production. Through htsget and other work streams under the new GA4GH Connect framework announced today, DNAnexus looks forward to further contributing from our experience and network to advance the GA4GH’s essential mission.

For more information about how DNAnexus is working with htsget, please contact us at info@dnanexus.com.

Supporting Freebayes, to Serve Our Customers and the Community

Freebayes is a variant calling tool for short-read sequencing by Erik Garrison, Gabor Marth, and others, which played a significant role in the 1000 Genomes Project. It’s widely appreciated for its quality results, cost-effective performance, and permissive open-source license. At DNAnexus, many of our customers have come to rely on it in their sequencing pipelines. But, like many software tools in genome informatics, its development might have stopped at the conclusion of its (hugely successful) sponsor project.

We listened to our customers, and heard clearly that freebayes is too valuable to let that happen. A few months ago, we began working with Erik on a roadmap for ongoing development and maintenance with our support. Through our collaboration, Erik recently delivered a capability to generate gVCF output files, a significant feature both for individual genome interpretation and for aggregate analysis of vast cohorts. We’re continuing to refine that feature, and we have many more queued up to ensure freebayes remains a tool of choice for both research and clinical sequencing pipelines.

Selection_145

Importantly, freebayes and our collective contributions to it will remain free for all to use and build upon, under the MIT license. Furthermore, best efforts will be made to assist all its users through public forums. We’d love to hear about your use cases and ideas to further improve freebayes – reach us on GitHub or Gitter, or send us a tweet. Erik remains in his day job, realizing a totally new paradigm in genome informatics, and we’re delighted he can also work with us to make freebayes endure as a tool the community can count on. So to all the genome hackers out there: please hack on freebayes too! You can read more on ‘How to Freebayes’ on Erik’s blog.

Because no single tool can possibly serve all applications, DNAnexus continues to work with numerous collaborators toward advancing methods in genome informatics, both free and commercially licensed. We also continue to wholeheartedly support our customers’ choice of methodologies to deploy on our platform, whether sourced from our partner network or elsewhere. We’re delighted by this opportunity to both deliver value to our customers and give back to the broader community. To the genome hackers again: we’re on the lookout for more of these opportunities! (We’re hiring, too!)