The First Publicly Available “$1000 Genome” Test Dataset!

At DNAnexus we’re always looking for ways to collaborate on projects that are outside the norm, and this latest collaboration is no exception. We’ve teamed up with the Garvan Institute and AllSeq to offer the genomics community open access to the first publicly available datasets generated using Illumina’s HiSeq X Ten sequencing system.

$1000 Genome X Ten


Why are we doing this?

Our goal is to provide sample data that will give scientists a glimpse into what to expect from the technological advances of the HiSeq X Ten. Has the new sequencing technology lived up to Illumina’s promise?


Here’s what went down

AllSeq arranged this data-sharing endeavor as a part of its Sequencing Marketplace effort, which aims to educate scientists about different sequencing technologies and match them with providers that offer these technologies.

The Garvan Institute, located in Sydney, Australia, was one of the first  three organizations in the world to acquire the Illumina HiSeq X Ten sequencer. In an effort to educate the genomics community about the potential of this exciting new technology, they made available two whole-genome sequencing data sets, using the popular Coriell Cell Repository NA12878 reference sample, which has been extensively analyzed by the Genome in a Bottle Consortium.

Thanks to the Garvan, visitors have access to two different, high quality data sets (NA12878D and NA12878J), each of which was sequenced on a single lane of an Illumina HiSeq X patterned flow cell, achieving over 120 Gb of yield, with >87% bases with quality > Q30 in just 2.8 days. Each dataset meets the minimum coverage and quality guaranteed by Illumina and is indicative of the potential for the Illumina HiSeq X Ten sequencing system.

DNAnexus stepped in to sponsor the data storage and the bandwidth for downloading the data. In addition, DNAnexus ran analyses on the two genomes to produce metrics providing a benchmark for the scientific community by which to gauge results from the “$1000 genome”.

Visitors can gain access to view and download the data without a DNAnexus account via the AllSeq webpage, which takes you to the original FASTQ files, as well as analysis results (BAM and VCF files), and quality metrics calculated using off the shelf tools like FastQC and Picard (MarkDuplicates, CollectInsertSizeMetrics, and CollectWgsMetrics). We’ve also provided a web-based genome browser to visualize one data set (NA12878D). You can access and download all of this data until September 30, 2014.

Those with DNAnexus accounts can also access the data via the “HiSeq X Ten Data” featured project, located on the left hand side of the dashboard. Users are able to copy any of the files to their own DNAnexus projects for further downstream analysis.

We’d love to hear from you! Tell us what you think about the HiSeq X Ten data:


The Hundred Year Study? Newborn Sequencing Grants Bring Opportunity for Long-Term Data Analysis

NIHThe announcement last month that the National Human Genome Research Institute and National Institute of Child Health and Human Development were awarding $25 million to study genome sequencing for newborns was welcome news to the genomics community — and will serve as a great opportunity to understand the long-term implications of analysis and storage of DNA data.

After all, in addition to the clinical and ethical implications of conducting genome sequencing from birth, there are a host of logistical questions, including how that data will be managed for the 90 or 100 years that many of these newborns will live?

Projects like these offer incredible opportunity to think about lifelong, clinically useful genomic data. We anticipate the need for storage and infrastructure that is far more dynamic than we’re used to today, with our flash drives or hard drives or DVDs. If you consider data gathered just 30 years ago, very few of the media on which that was all stored are even accessible with current technology – anyone remember ZIP drives and floppy disks? Continuing innovations in media not only render older storage technologies obsolete, but all too often they are completely incompatible with each other.

For these new projects that will sequence thousands of individuals from newborns to adults, it’s simply not realistic to expect a team of scientists to manually shift data every few years across several different types of media, to keep these important genome sequences easily accessible. That’s why we think cloud computing will be the best answer for programs like this one. Cloud providers already make it their business to use the latest and greatest technology, and they have entire teams of experts who spend their days making sure data will remain live and readily accessible in the long term.

Cloud computing services can also ensure ongoing clinical compliance and rock-solid security, two critical needs for data sets like these newborn genomes. And it can scale up easily and cost-effectively as demand for newborn genome sequencing soars in the coming years, providing non-redundant, secure, readily-accessible resources.

We look forward to seeing the results of these valuable new studies, and to participating in the discussion as the community thinks about best practices for interpreting and managing data that may need to be maintained for a century or more.