On Being Platform Agnostic

One inevitable outcome of the ever-expanding number of DNA sequencing platforms is the lock-step addition of new data types. The technologies developed by Complete Genomics, Illumina, Life Tech/ABI/Ion Torrent and Pacific Biosciences produce the lion’s share of genomic data today. But Genia, GnuBio, NABsys, Oxford Nanopore and others are in the wings, poised to pile significantly more on.

Every sequencing platform relies on a different technology to read the As, Ts, Cs, and Gs of a genome. This presents a number of major challenges in assembly and sequence accuracy across platforms due to varying read lengths, method-specific data generation, sequencing errors, and so forth. However, while all have their nuances, they all have potential value to the progress of life science and medical research.

A complete solution to this problem would involve models for each platform, accounting for the generation and characteristics of libraries, data collection, transcript distributions, read lengths, error rates, and so on. The fact that a standard solution for integrating all these data types doesn’t currently exist is a testament to the difficulty of this task, which shouldn’t be underestimated.

The solutions most commonly used today for managing this diversity of data are the products of enterprising bioinformaticians who have developed “home-brewed” applications capable of taking primary data created by the instrument and, among other tricks, performing alignments to a reference genome and/or completing assemblies. While these workarounds provide a band-aid, they are not available for all platforms, rarely scalable and take highly experienced technical users to manage.

As genomic data continues its march beyond core facilities and into a broader range of research labs, healthcare organizations and, eventually, point-of-care providers, the need becomes even more acute for technologies that can — as far as the user is concerned — effortlessly perform the challenging tasks of integrating data from multiple sources for annotation and interpretation and combining them with the analysis and collaboration tools needed to glean insights.

As an industry, we need to start taking a more platform-agnostic approach towards the analysis and visualization of sequencing data. This is particularly critical as new platforms enter the market, collaborations across institutions, labs and borders expand and “legacy” data is incorporated into new repositories.

At DNAnexus, we are committed to removing the complexities inherent in working with diverse datasets so that scientists and clinicians can focus on the more impactful areas of data analysis and knowledge extraction. We are also committed to providing a secure and user-friendly online workspace where collaboration and data sharing can flourish.

Stay tuned for much more on this topic and let us know about the challenges you face when working with multiple data types and what kind of datasets you’d like to see more easily integrated into your work.

AGBT in Review: Highlights and High Hopes for Data

Last week’s Advances in Genome Biology and Technology (AGBT) meeting was every bit the fast-paced roller coaster ride we were anticipating. As expected, there were no major leaps announced by the established vendors, although Illumina, Life Tech’s Ion Torrent, and Pacific Biosciences all had a big presence at the conference.

View from my hotel room: I got lucky with an ocean front room

The biggest splash by far came from Oxford Nanopore Technologies, which emerged from stealth mode with a talk from Chief Technology Officer Clive Brown. The company’s technology sequences DNA by detecting electrical current as the strand moves through a nanopore. Brown said the technology had been used successfully to sequence the phi X genome (a single 10 KB read got the sense and antisense strands) and the lambda genome (a 48 KB genome also covered in a single pass). Brown reported raw read error rate of 4 percent, mostly caused by the DNA strand oscillating in the nanopore instead of moving smoothly through it. Other significant features: the nanopore can read RNA directly, detect methylation status, and be used directly from a sample (such as blood) – no prep required.

What I thought was most interesting, though, was that at a meeting known for being wall-to-wall sequencing technology, this year’s event really focused more on two arenas: clinical genomics and data analysis. The conference kicked off with a session on clinical translation of genomics, with speakers including Lynn Jorde from the University of Utah and Heidi Rehm from Harvard. Both talked about the key challenges in data analysis and interpretation, with Rehm in particular stressing the need for a broadly accessible data platform with clinical-grade information that could be ranked with confidence level and would pull data together from a variety of disparate sources. Notably, the clinical talks generally were limited by small sample sizes, and sometimes wound up with results that were inconclusive in recommending a particular course of treatment. That’s to be expected in the early stages of moving sequence data into a clinical environment, of course, but it also underscores the opportunities here once low-cost sequencing becomes widely available.

The trend was clear: data, data, data. And the only way to make the most of all that data will be to pave the way to an environment where information can be accessed and shared easily, with as many tools as possible to interrogate, analyze, and validate it.

Sequence Data: The View from JP Morgan

Last week, Andrew Lee, Vice President of Strategic Operations, and I attended the JP Morgan Healthcare Conference, an annual investor conference here in San Francisco that brings some 25,000 people to the city. The big news this year came from Life Technologies and Illumina, which both announced platforms that will be capable of sequencing an entire human genome in a day. Life Tech in particular noted that its same-day genome sequence will cost $1,000 in reagents — effectively putting an end to a race that began 10 years ago, when scientists first started seriously competing to achieve the $1,000 genome.

With this price point achieved, we expect people to sequence genomes at a much faster clip than ever before. Indeed, a survey from GenomeWeb and Mizuho Securities found that scientists anticipate that sequencing data will increase 32 percent this year over 2011, and increase another 38 percent next year compared to this year. That’s exciting on the data analysis front: As the volume of DNA data grows exponentially, it’s even more important to have a scalable platform to manage, store, and analyze that data securely and efficiently. The GenomeWeb/Mizuho survey also found that people expect to spend more on informatics in 2012 than they did in 2011.

After all, mainstream genome sequencing won’t be possible until the analysis costs come down by orders of magnitude. Even with so much attention on cutting the price tag for sequencing technologies, that didn’t translate to matching improvements for data costs. The race to the $1,000 genome may be over, but as it turns out, that was just the first leg of a relay. Now the baton has been passed to the data management and analysis folks, and it’s our turn to run as fast as we can.