Investigating Differences Between Simulated and Real FASTQ DNA-Seq Data

Blog Authors

In this blog, we explore some of the ways that simulated FASTQ data differs from real reads generated by a sequencer on a true human genome. We conducted this analysis through the lens of the development of a new strategy for discovering copy number variants. The differences we encountered illustrated some of the challenges of using simulated data. We used these differences to analyze the quality of reference genomes themselves and compare GRCh37 with GRCh38.

Introduction

Both the development and assessment of any new method start from well-understood initial datasets. When available, well-characterized exemplar datasets (for example, the Genome in a Bottle genomes for SNP and Indel calling) are powerful resources. However, in some cases, creating simulated datasets with understood and controlled properties is the only way to develop.

In genomics, many simulation methods have been written specifically for this purpose, including Mason, WGSIM, and DWGSIM for FASTQ generation and a myriad of others for small variants, structural variants, gene expression, and metagenomes.At the same time, it is very challenging to create a simulation which fully captures all of the biases in real systems. A FASTQ simulator must model the biases in GC sequence context, the base, insertion, and deletion error profile, and the insert size distribution. Further, it must capture how each term interacts with the others.

In working with simulated data to form a baseline, we encountered an additional issue in FASTQ simulation, which arise not because the simulators imperfectly model sequencing, but because the reference genome itself is still imperfect. This leads to regions where the sequence content is significantly over- or under-represented relative to sequence of a human from a real sequencer.

How Incompleteness of the Reference Might Impact Simulations

Most genomics analyses are conducted relative to reference genomes. Having a reference makes analysis much easier computationally. It also serves as a shared yardstick against which other samples can be measured, creating a common vocabulary for gene and variant positions and content. Reference genomes are hard to assemble because many parts of the genome look similar to each other (such as centromeres, telomeres, mobile elements, and segmental duplications).

The human reference genome is continually evolving, with the most recent version being GRCh38. However, the last version was produced in 2013, before powerful long-range technologies like PacBio, 10X, Oxford Nanopore, Dovetail, Phase, BioNano, and Nabsys started to hit full stride and which could complement the strong and intensive BAC by BAC approach.

Reference Genome

The above figure contains a highly simplified schematic of a chromosome. The purple, red, and orange regions represent types of areas where the reference sequence is likely incomplete. The green simulated reads demonstrate that near these regions the simulation only samples partially.

Sashimi – a Method for Copy Number Variation Leveraging Salmon

To frame our work, we must give an overview for a new method we have been developing to discover copy number variations (CNV). Here we provide a high-level overview of Sashimi. In a future blog post, we will delve into methods and benchmarks in greater detail.

Sashimi is a method that is built on top of Salmon, which performs RNA-Seq quantification. The central idea behind Sashimi is that detecting the number of copies of a segment of DNA is conceptually similar to detecting the copies of RNA. Salmon models and corrects for a number of complex factors that make structural variant calling hard (bias GC sampling and smart weighting of ambiguous support), and is both rapid and learns these weights directly from the data.

Salmon produces a normalized value to quantify the number of copies for a sequence represented in the data as Transcripts per Million (TPM), which in our case is instead an estimate of the DNA copies. In Sashimi, we index the genome into 500bp bins. We run Salmon to determine TPM counts for each segment. We smooth and normalize this signal to identify CNVs.

We simulated deletion events in the reference with SURVIVOR and simulated reads with Mason. To compare with real data, we used a set of candidate truth structural variants (v0.6) generated by Genome in a Bottle for HG002 and a 35X HiSeqX sample from the precisionFDA Truth Challenge.

Differences in TPM Calculated by Salmon from Simulated and Real FASTQ Data

TPM Calculations

The left figure above shows the TPM value from real data across chromosome 1, while the right shows the same values for the simulated data. Note the difference in the y-axes between the two: the real data has a maximum value more than 1000 times that of the simulated data. The regions with 0 TPM are scaffold sequence (NNNNN…).

Why are these plots so different? The region 5’ to the centromere has a number of sequence repeats. As a result, there will be many reads with this sequence. Salmon has a clever approach to manage reads that could support multiple transcripts: it assigns them to the different models in a weighted manner based on the probability to support them.

Since the reference is incomplete, the number of reads in a real sample will proportionally be much higher than the regions for Salmon to divide them. Thus in a real sample this will cause coverage to pile up in regions with sequences that resemble the missing components. Because Mason can only consider the reference as correct, it simulates data in proportion to the incomplete reference.

Setting the TPM cap to 10 allows us to see a host of other regions with a similar effect:

TPM chromosome value

The figures below show the density for TPM values calculated for real and simulated data across the genome binned by TPM. The peaks at the right extreme TPM show the same effect.

KDE by TPM

In Regions Represented Well by the Reference, Differences Between Simulated and Read Data are Detectable

We can constrain our analyses to regions outside of these extreme peaks. When we do so, it is still possible to detect the differences between real and simulated data when viewed from our lens of Salmon TPM.

TPM value for fragment

The TPM values are much more consistent in this region (most genome regions look like this), and both have a mean of around 0.2 TPM. However, there is more variability in the real data. This likely reflects some sampling bias (e.g. GC or sequence context) that is not captured in the simulation.

Measures of Precision and Recall for a Method Can Be Very Different in Simulated Data

Unfortunately, this sort of variability present in real data but not simulated data is exactly the sort of thing that makes CNV calling much more difficult. We tried a variety of smoothing and machine learning to call CNV events. The performance we could achieve in simulated data was very different from the performance we could achieve on the Genome in a Bottle set.

F-1 Score

If we took the accuracy of Sashimi only on simulated data and compared it to our benchmarks of other CNV methods on Genome in a Bottle, we would conclude Sashimi is best in class. If we constrained the simulation to only chromosome 21, we would conclude the problem to be virtually solved.

Going from real data to simulated and then from simulated to only chromosome 21 simplifies the problem we are trying to solve in artificial ways – because the simulation doesn’t reflect key properties relevant to CNV calling. The chromosome21 sample is easier still because it removes the complexity of interchromosomal segmental duplications.

Why This Affects Processes Like CNV Calling or SNP and Indel Calling

In a real sample, the incompleteness of the reference will cause coverage to pile up in regions with sequences that resemble the missing components. These mismapped reads may have sequence divergence which falsely looks like SNP and Indel signal to variant callers. Common signals for structural variant and CNV callers (reads mapped in proper pairs, average depth, insert size) will also be increased in these sorts of regions. The GC variability observed will also cause fluctuation in coverage. When using simulated data, these features will not be present.

Because of the way structural variant and CNV callers work, they will likely be more affected than SNP and Indel callers. For SNP and Indel calling, some of this may explain observations by Heng Li on difficulties calling in high coverage regions, and the recommendation to filter low-complexity regions.

The Difference Between Simulation and Reality Allows Us to Assess Improvements in the Reference Genome

All of the analyses conducted so far have been on the GRCh37 reference genome. Instead of measuring samples by comparing to the reference, we can measure the reference by comparing to the samples.

The figure below shows the same graph for real HG002 when applying Salmon with GRCh37 (hs37d5 version)  or GRCh38 (with decoys but not ALT). Note the scale difference between the two figures. Many of the gaps have been narrowed.

TPM value 1

Here is the same, but with a cap of 10 TPM, showing improvement in hg38 beyond just the most impacted regions.

10 TPM Cap

The kernel density estimator plot quantifies the overall reduction of incompleteness in a genome-wide manner between GRCh37 and GRCh38 (again note the difference in TPM scales):

KDE by TPM HG

We presume that, as the human reference genome continues to improve, these artifacts in the Salmon quantification will decrease. It should also be possible to use this measure to assess regions that are incomplete in non-human references with varying degrees of polish, and to automatically estimate the approximate sizes of gaps.

There is additional complexity for not discussed here – the current reference is a “monoploid” representation and so inherently departs from any of us a diploid humans. This limitation may be improved over time as more high-quality references become available representing diverse ancestries (any two of which could be combined for the purposes of simulating a diploid).

Conclusion

We hope that we have provided insight into some of the differences that may arise in using simulated data, particularly FASTQs. We expect this effect to be relatively strong with regards to CNV calling. We think this phenomenon will likely affect the use of simulated data for SNP and Indel calling to a lesser extent, with a greater impact felt in regions of low sequence complexity and those near reference gaps.

We hope that this method could be promising for assessing reference genomes, and we hope this has given some insight into potential improvements in GRCh38.

Although these issues have made it difficult to build Sashimi as a full, genome-wide CNV caller, we believe that the approach will be viable for detecting large deletions, for modeling CNVs on a gene-by-gene manner, and for confirming events. We hope to detail Sashimi for this use case in a later post.

Final Note About the Authors:

Ola and Calvin

DNAnexus has teams of software engineers, who build and maintain the infrastructure, web GUI, security systems, and developer tools that power the DNAnexus Platform. We also have teams of bioinformaticians who develop applications on DNAnexus, help customers deploy their workflows, and represent DNAnexus in the scientific community.

DNAnexus has a commitment to the development and growth of our employees and to connect with the scientific community. As a part of that, we have the option for both scientists and engineers to take 10% of their time (currently Friday afternoon) to work collaboratively on projects.

The authors of this work, Aleksandra (Ola) Zalcman and Calvin Bao are software engineers at DNAnexus. The figures and analysis you see here are a small fraction of the body of work that contains the various ideas, pivots, false starts, and side investigations that come with developing a new method. Ola, as the lead author in this work, has shown great dedication in continuing to push the work forward, experiment by experiment, every Friday.

This work speaks to the fact that science isn’t something you need a specific pedigree or credential to do – science only requires you to bring curiosity, collaboration, rigor, and effort. Being able to work in an environment full of such sharp, collaborative people with different backgrounds and expertise is one of the best things about working at DNAnexus.

We thank Deanna Church of 10X Genomics and John Didion of DNAnexus for helpful comments,  Samantha Zarate for technical editing, and the authors of Salmon: Rob Patro, Geet Duggal, Michael Love, Rafael Irizarry, and Carl Kingsford.

Methods and Commands:

Jupyter NotebookA Jupyter notebook with the analysis and data conducted in this blog can be downloaded at this link (note that this is a 500 MB download).

The applets and index files are available in this public DNAnexus project (anyone with an account can access this project) from the link above.

The commands and parameters used to generate the reads, run Salmon, and simulate the events:

Generate Salmon Index

Split the reference fasta into 500bp contigs. Index with:

salmon index -t GRCh38.no_alt_analysis_set.all.500base_chunk.fa -i GRCh38.no_alt_analysis_set.all.500base_chunk_k31_index -p 32 –type quasi -k 31

Simulate Reads with Mason:

mason_simulator -ir ${REF} –seed ${SEED} -o mason.R1.fastq -or mason.R2.fastq –illumina-read-length 150 -n ${READ_NUMBER} –read-name-prefix ${PREFIX} –illumina-prob-insert 0.00005 –illumina-prob-deletion 0.00005

Quantify with Salmon:

salmon quant -i GRCh38.no_alt_analysis_set.all.500base_chunk_k31_index -l IU -1 HG002NA2438550x_1.70_percent.fastq -2 HG002NA2438550x_2.70_percent.fastq -o HG002NA2438550x__quant -k 31 -p 32 –seqBias –numBiasSamples 1000000 –posBias –gcBias

Simulate CNV Events with SURVIVOR:

SURVIVOR 1 ${REF} parameters.txt 0 0 ${PREFIX}

The parameter file can be downloaded from the link above.

SMRT APAC Genome Assembly Grant & Upcoming Webinar

SMART APAC GrantSubmit your unique plant or animal genome proposal for a chance to win free de novo assembly services on PacBio SMRT Sequencing data. See details below.

De novo genome assembly is a complex task, which can require massive computational resources to weave long-reads into a final, polished assembly. Plants and some species of animals can prove particularly challenging due to their high levels of genetic diversity, repetitive elements, and duplicated genomic regions. Bringing together multiple technologies (long-read, short-read, next-generation mapping) can improve contigs and scaffolding to provide more complete and accurate genome assemblies.

At DNAnexus we have a team of scientists that provide fast, accurate, and cost-efficient reference quality assembly services. Our key experience includes 3000 Rice Genomes Project, Vertebrate Genomes Project, along with many complex individual assemblies, including the tobacco genome (4.5 billion base pairs, tetraploid & highly repetitive) and the immune recognition regions of the Rhesus macaque.

APAC Genome Webinar

Join us, in collaboration with our partners PacBio and Microsoft, for a webinar – Best Practices for Rapid Reference-Quality Genome Assembly – which takes place September 5th at an APAC friendly time 10:00 AM (China Standard Time). Register today.

In this webinar, you’ll learn:

  • Why de novo assembly is important
  • How we define a reference quality genome
  • Best practices with PacBio’s FALCON, FALCON-Unzip, & NHGRI’s Canu assemblers
  • Layering next-gen mapping for more accurate assemblies
  • Case study: Rhesus macaque
  • Benefits of DNAnexus assembly services

La Jolla Institute and University of LouisvilleFor the webinar case study, the La Jolla Institute for Allergy and Immunology and University of Louisville utilized DNAnexus de novo assembly services to serve the need for a better quality R. macaque reference genome for the research community to use. The R. macaque genome provides a good reference for use in vaccine trials in order to understand both efficacy and antibody response. Examining this response though proved challenging as the research community lacked the full picture of the genome. Although the genome assembly of primate is not difficult in general, the DNA regions involved in immune recognition are difficult because of the highly repetitive and segmental duplication and haplotype variation between individuals. DNAnexus scientists were able to assemble the genome to get 2.8 billion base pairs out of the entire genome with an N50 of contigs length of 8.4 million base pairs and N50 of haplotype contigs length of 451,000base pairs.

SMRT APAC GRANT

DNAnexus, is excited to announce the SMRT APAC Grant powered by DNAnexus, through which we are offering free de novo assembly for the most unique plant or animal genome in the world. One lucky winner will be selected for DNAnexus de novo assembly services on PacBio SMRT Sequencing data. To enter simply submit a 250-word proposal, which includes a short description of your particular plant or animal genome, (up to 1.5 Gbp in size), and how its sequencing and de novo assembly would benefit the larger scientific community. We are now accepting submissions on the SMRT APAC Grant website. Please note de novo assembly services will only be applied to data generated through PacBio SMRT Sequencing and sequencing is not included in the SMRT APAC Grant.

Deadline for submission is September 30th with the winner announced the week of October 8th.

Do you have genome that needs to be assembled? Learn more about our de novo assembly services or request a quote.

Breaking Down Crumble: A New Method to Significantly Reduce NGS Data Footprint while Preserving Results

Alessandro Andrew Authors

Crumble is a newly published NGS compression method by James Bonfield, Shane McCarthy, Richard Durbin of the Sanger Institute. In this post, we demonstrate that Crumble provides size savings of 20-70% for BAM and CRAM files from HiSeq2500, HiSeqX, and NovaSeq Whole Genome Sequencing (WGS) and exomes. We profile the resulting variant calls, showing generally similar results when using Crumble, even slightly improving variant-calling accuracy on the same sequencing data.

Introduction

Among computational fields, genomics is especially data-heavy. For a 30x WGS sample, the size of the BAM file can easily exceed 100 GB. The collective size of genomics projects is growing at a tremendous rate. One highly cited work by Stephens et al. projects that, with current trends, the volume of NGS data in 2025 will exceed the data footprint of YouTube, Twitter, and astronomy.

Most of the data footprint is in BAM files. In this format, the data footprint is divided between Read Names, the Sequence itself, Quality Values (QV) and Tags. The CRAM format uses lossless reference compression for the sequence, pushing the relative data footprint toward storing the quality values (see our Readshift blog for background on QV).

QV UncompressedIn an uncompressed (SAM) file, each quality value occupies a bin with a fixed amount of space (see the figure on the right, where each quality value is represented as a different shape). The BAM format compresses this data with a standard method similar to gzip. To simplify compression methods greatly, they identify recurring patterns within the data that can be used to pack elements together. More irregular data (data with higher entropy) is harder to compress. This blog from Dropbox contains interesting examples on how compression methods can work.

As Illumina has progressively increased instrument throughput, it has fought the growing friction in managing data footprint by decreasing the range of possible QV (HiSeq2500 – 40, HiSeqX – 8, NovaSeq – 4). The figures below illustrate how packing a stream of QV pieces together is easier when there are fewer types of QVs and these types are more similar. These changes are generally synergistic with other advanced compression technologies such as Petagene.

QV Compressed Comparison

The Crumble paper calls this strategy “horizontal” because it applies generally to all bases. Crumble additionally applies a “vertical” approach to consider the context of surrounding reads, preserving more quality values in regions of lower coverage or in signals of interest (like variant positions) and compressing more in regions where reducing entropy is unlikely to impact results. The figure below attempts to conceptually represent some of the conditions under which Crumble bins more or less aggressively.

Crumble Results

Results: Crumble Saves Significant Space for Both Whole Genomes and Exomes

To assess Crumble, we first applied it to several WGS and exome BAM and CRAM files produced across different machines. As Figure 1 illustrates, Crumble saves significant space across all types of WGS files, with larger savings in HiSeqX compared to NovaSeq. In these charts, Crumble is run at its default settings. Lower numbers are better (i.e. BAM/CRAM which takes less space).

Crumble Compressions

 Crumble Compression 2

 

The average savings in size relative to the starting state are:

Sample Type Percent Size Savings (BAM) Percent Size Savings (CRAM)
WGS HiSeqX 55 % 70 %
WGS NovaSeq 20 % 34 %
Exome 47 % 60 %

Applying Crumble Improves Variant Calling Results in Almost All Conditions

Crumble’s intelligent application of binning allows it to minimize possible negative consequences. Interestingly, Crumble has a positive impact on variant calling results overall. We suppose this occurs because Crumble bins more aggressively in regions of unusually high coverage or poor mapping. Though QVs indicate the probability that a given base is in error, taking into account the additional probability that a read is mismapped is a more complex calculation that variant callers are likely overconfident about. Crumble’s binning may correct for this by reducing unfounded confidence.

Crumble improved accuracy on NovaSeq WGS for almost every pipeline, with substantial decreases seen in several. In all charts that follow, lower values on the y-axis (that is fewer errors) is better.

Crumble NovaSeq

For exomes, the changes in accuracy were less pronounced. Only GATK showing a substantial difference, with an improvement of 10%.

Crumble Exome

The following table summarizes how Crumble impacts variant calling across the use cases. In this, larger negative numbers indicate that Crumble has a larger improvement. In almost every case, Crumble improves results in 73% of investigated cases, with substantial (~10% of more) improvements in 33 % of cases. In those cases where Crumble made pipelines worse, the impact was small. GATK4, in particular, was consistently helped by Crumble.

Change in total errors by using Crumble v0.8.1 relative to baseline (non-Crumble).

Strelka2 DeepVariant Freebayes Sentieon GATK4
HiSeqX WGS – 1.3 % – 19.2 % – 5.2% + 3.1% – 27.1%
NovaSeq WGS – 1.2 % + 3.5 % – 2.3% – 10.3% – 22.6%
Exome – 3.0 % + 1.4 % + 1.3 % + 0.1 % – 9.9%

Crumble Level Parameter is More Important for Exomes. Using Level 9 (Default) is Different for Both WGS and Exomes

Crumble comes with presets for parameter combinations that correspond to increasingly aggressive binning strategies (Level 1, 3, 5, 7, 8, and 9, with 9 as the default). We applied each of these settings to 4 WGS samples (HiSeqX HG001, HG002, HG005, NovaSeq HG001) and 3 exomes. In every case, our results were consistent between WGS samples and between exome samples. Therefore, we present results with only one representative in each.

Our initial investigation of level used Crumble v0.8. Version 0.8.1 was released 2 weeks after, adding a new Level 9 and changing the old Level 9 to Level 8, but not otherwise changing settings. So we have only re-run the new modes in this chart.

Crumble and Output

Crumble and Output2

Crumble Senteion

Crumble and Senteion2

Crumble Is Fast and Cheap to Run

By running Crumble, a user chooses to pay a compute cost in order to achieve persistent savings in storage. This cost-benefit decision depends on the speed of the method. Fortunately, Crumble is extremely fast, especially when compared to methods for mapping samples or calling variants. We are generally able to run Crumble on a 35X WGS in just 8 CPU-hours.

Unfortunately, Crumble does not currently support multi-threading. However, users can restrict it to running on a genomic region. So, to run Crumble quickly and efficiently, we do the following:

  • Generate a contig list from the BAM header,
  • Set the unplaced reads aside
  • Use gnu-parallel to apply Crumble in parallel to each chromosome.
  • use samtools cat to combine everything.

(the following image is a code snippet from our app code):

Crumble App Code

With this, we can execute the Crumble program in about 1 hour of wall-clock time on an 8-core machine (download and uploading to the cloud worker both take a bit of additional time).

The break-even point at which point the storage savings justify the extra compute cost depends on your relative costs of each. However, under most realistic scenarios, running Crumble will pay back the investment in just a few weeks to months, and will accrue lasting savings well beyond that. There are other benefits to reducing data size: it is easier for collaborators to access, and it takes less time to load into and out of cloud workers from cloud storage.

Crumble on DNAnexus

An efficiently wrapped app to run Crumble is available to all users on the DNAnexus app library at https://platform.dnanexus.com/app/crumble (platform login required to access this link). This app can both take data and produce output in BAM or CRAM output (CRAM output requires a matching reference genome).

Crumble workflowCrumble App

Final Notes

These tests are all performed on germline sequencing. It is unknown how Crumble would interact with somatic methods. There are other operations which can shrink BAM/CRAM files further without impact, such as lossy compression of read names. We hope to explore these in a later post.

Conclusion

We hope we have demonstrated that Crumble is a useful tool capable of achieving significant space savings on diverse data. Under most conditions, Crumble has a negligible or positive impact of downstream processes. It can be run at Level 7 to effectively compress both exomes and WGS. The rapid speed of Crumble means that it is favorable to run from a cost-benefit evaluation.

As genomics continues to grow in scale, we will need methods such as Crumble that enable us to consider how to ensure that, as a field, we have solutions which can scale to future needs. The value of genomics should be judged not by the magnitude of its data footprint, but by the value the field brings to society.

We thank the authors of Crumble, James Bonfield, Shane McCarthy, and Richard Durbin of the Sanger Institute, for their hard work in making this tool available to the community, and for providing helpful comments and suggestions on this blog.