Pacific Biosciences, a key partner of DNAnexus, has released the first public Sequel™ dataset of NA12878. This is a 10-fold coverage set featuring 32.8Gb of data, with an N50 read length of 11.8kb. Generated by PacBio’s new Sequel System, this dataset was used to demonstrate the robust ability of even low coverage long-read data to discover novel structural variants. The Sequel System is smaller, faster, and provides higher throughput, delivering around 7X the amount of data as the PacBio RS II.
Aside from structural variation, PacBio long reads have been used for robust, high quality de novo genome and transcriptome assemblies. Additionally, with instrument and chemistry improvements made for the Sequel System, the cost for generating a 50-fold coverage human dataset for resequencing and de novo assembly is expected to decrease dramatically.
Despite sequencing costs dropping, de novo genome assembly and structural variant calling remain complex tasks; ones that can require massive computational resources to weave long reads into a final, polished assembly or to run structural variation detection methods across multiple data types. For this reason, PacBio has selected DNAnexus to be its cloud bioinformatics partner, providing bioinformatics support to its global customers. The SMRT® Analysis Suite v3.1.1 is available on the DNAnexus Platform and has been optimized for the cloud environment, as well as other long read analysis tools, such as PBHoney, PBJelly, and Parliament.
Curious about PacBio tools and services on DNAnexus? Schedule a 30-minute scientific consultation.
De novo assemblies of individual human genomes via the PacBio RS II at high-fold coverage have revealed tens of thousands of structural variants, many of which are accessible only through SMRT Sequencing. In an effort to optimize SV discovery methods, PacBio set out to understand what SV’s could be identified in a well-studied human sample NA12878 from low-fold coverage sequencing on the new Sequel System. To create the NA12878 Sequel dataset, PacBio generated approximately 10-fold coverage of the NA12878 sample on the Sequel System, which comes to about $5,000 in sequencing cost. The newly generated long reads were mapped to GRCh37 human reference using NGM-LR, and structural variants were called with PBHoney.
The output calls were compared to a “truth set” generated as a merged set between the 1000 Genomes Project and Genome in a Bottle NA12878 sets, both of which were analyzed using short read technology at much higher coverage. The low coverage PacBio 10-fold Sequel System set recalled 86% of truth set deletions and 81% of truth set insertions. Above and beyond, the 10-fold Sequel call set identified thousands of insertions and deletions not found in the short read truth set, with over 66% of these novel structural variants verified using a FALCON-Unzip 60-fold PacBio RS II de novo assembly.
This dataset demonstrates that low coverage Sequel reads can be used for accurate variant calling as well as novel structural variant identification, all of which is now available at a fraction of the cost with the new Sequel System. Through the partnership with DNAnexus, you can recreate the analysis performed on the NA12878 10-fold set. These tools and the generated dataset can be found on DNAnexus under Featured Projects.
Questions? Contact us directly at: pacbio@dnanexus.com.