Comparison of BGISEQ 500 to Illumina NovaSeq Data

Andrew Carroll




Last Thursday, BGI uploaded three WGS data sets of NA12878/HG001. Included in this was a challenge to conduct a side-by-side analysis.

BGISEQ TweetAlbert Villela of Cambridge Epigenetix drew our attention to this dataset. Given our love of benchmarking and new technologies, we applied our evaluation frameworks to this data. The technology behind the BGI-SEQ 500 was based on Complete Genomics, which was acquired by BGI.

This uses DNA Nanoballs and a probe-based method of sequencing. We have seen data from this instrument in PE50, PE100, and PE150 formats, with the number indicating the length of paired-end reads (longer reads generally lead to better analysis).

A dataset for HG001 was previously submitted to Genome in a Bottle last year, with PE50 and PE100 reads and DNAnexus conducted an assessment of these data. This analysis indicated a reasonable gap, particularly in Indels. We also demonstrated that by re-training deep learning models on BGISEQ data (as Jason Chin did with Clairvoyante and Pi-Chuan Chang with DeepVariant) it is possible to bridge this gap using only software.

Given that this new dataset is released one year later and uses longer reads, the new set provides a measure of the progress BGI has made on the instrument and the difference in using longer reads. To quickly summarize, this most recent release of BGISEQ demonstrates significant improvements relative to prior data, though a (modest) gap relative to Illumina remains.

Performance Comparisons

We downloaded the 3 WGS sets directly from EBI. All of these data were generated and submitted by BGI – 2 runs from the BGISEQ500 PE150, and one run from NovaSeq6000 presumably operated by BGI. We analyzed each WGS set in its entirety through several standard pipelines – DeepVariant, Sentieon, Strelka2, GATK4, and Freebayes. Mapping was performed by Sentieon, which produces identical output to BWA-MEM and is faster.

Because the Illumina data here was submitted by BGI, we thought it fair to also include Illumina data generated by Illumina, so we used 35X WGS NovaSeq data available from BaseSpace that we have previously included in our Readshift blog.

BGISEQ ComparisonFigure 1 shows the SNP accuracy (combining false positives and false negatives). Lower bars in this chart indicate better performance. Some take-aways:

  1. The gap between Illumina NovaSeq and BGISEQ is quite narrow in these data. The difference between BGISEQ and Illumina BaseSpace for GATK is about the same as the difference in pipeline choice on Illumina data between GATK and DeepVariant. By this logic, if groups are OK in terms of accuracy choosing GATK, they should be OK to consider BGISEQ as well.
  2. The accuracy of the Illumina data available on BaseSpace is greater than the accuracy for the set uploaded by BGI. For SNPs, Illumina’s BaseSpace set is more representative of what we see in the community. The BGI set seems to be a somewhat worse sequencing run. The yellow illumina set is probably the fairer comparison.
  3. Note that the version of DeepVariant is NOT the one tuned with BGISEQ data ( we did not run in the BGISEQ tuned model for this investigation). It would be interesting to see whether performance will improve further on that model.

BGISEQ Comparison 2Figure 2 shows the Indel accuracy. All of the observations made previously apply with the addition of the following:

Based on the error profile of the callers here, the Illumina dataset submitted by BGI is clearly a PCR-positive dataset (based on the much stronger performance of Strelka2 and DeepVariant). Also, the impact of PCR on these data is slightly more pronounced than with other PCR-positive datasets we see.

Presumably, all available BGISEQ options are PCR-positive. However, Illumina provides the option of both PCR-free and PCR-positive preparations. Given this, it seems more fair to recommend a comparison relative to the yellow Illumina-operated NovaSeq. In addition, this is a good time to remind readers that they should be aware of the impact of PCR in sequencing quality and whether their datasets were generated with it.

Breakdown of False Positives and False Negatives

When we do these benchmarks, the most frequent request is to break down false positives and false negatives on the datasets. Figures 3 and 4 show this for SNP and Indels in one of the BGISEQ samples (which is representative of the other as well):

BGISEQ SNP Comparison

BGISEQ SNP Comparison2Computational Performance 

Finally, you may wonder if any of the programs have issues running on BGISEQ data (or run longer). The answer is – not really. Computationally, performance seems similar to what we see in Illumina data:

BGISEQ Core Hours Comparison

Conclusion –

If this newest data is broadly representative of BGISEQ performance, the BGISEQ looks like a technology worth considering. The price points that we have heard second-hand indicate that buyers would consider tradeoffs between less widely adopted and (slightly) less accurate BGISEQ genomes in exchange for better economics. Based on these benchmarks, these differences in accuracy are not so extreme that BGISEQ genomes would be considered very different.

It is important to note that these samples are PE150, the PE50 and PE100 may have worse performance. It is also important to note that these were specifically put out by BGI and likely represent the highest quality runs from the instrument.

Given that it is still early in the lifecycle of the instrument, it will be important to rigorously QC runs until the community has a good feel for the consistency of BGISEQ quality. If anyone else has runs of HG001/HG002/HG005 from the BGISEQ, I would love you to reach out to us ( so we can replicate this analysis from community-driven runs.

Amplifying Google’s DeepVariant

In this blog we quantify Google Brain’s recent improvements to DeepVariant – detailing significant improvement in both exomes and PCR genomes. We reflect on how this improvement  was achieved and what it suggests for deep learning within bioinformatics.


The Google Brain Team released DeepVariant as an open-source GitHub repository in December 2017. Our initial evaluation of DeepVariant and our Readshift evaluation (blog) method (code) identified that while DeepVariant had the highest accuracy of all methods on the majority of samples, there were a few outliers with much higher Indel error rates.

Subsequently, we realized that these difficult samples had a common feature – they were all prepared with PCR. Also, the initial release of DeepVariant had not been trained on exome data. Based on this, DNAnexus provided exome and Garvan Institute PCR WGS samples to Google Brain to train improved DeepVariant models, which are now released. We also recognize Brad Chapman, who independently collaborated with Google Brain.

Improving Deep Learning Methods Compared to Improving Traditional Methods

Many factors make variant calling more complex in PCR+ samples and exomes. PCR amplification is biased by GC-content and other factors. Errors that accumulate in the PCR process are difficult to distinguish from true variants. Exomes have an additional complexity of uneven capture efficiency and coverage.

Knowledge of these facts is incorporated differently in human-programmed methods, traditional machine learning, and deep learning methods. Human-written methods require a programmer to carefully develop heuristics which capture the relevant properties without too much rigidity. Traditional machine learning approaches require a scientist to identify informative features that capture the information, encode those from the raw data, and train and apply models.

Deep learning based methods like DeepVariant attempt to represent the underlying data in as raw a manner as possible, relying on the deep learning models to learn the features. Here, the scientist’s role becomes to identify the examples that embody the diversity of conditions the tool must solve, and to organize how to present as many and varied of these examples to the training machinery as possible. How Google Brain made these improvements may be as interesting as the improvements themselves.

Exome-Trained Models Significantly Improve Performance

Google Brain’s initial release was DeepVariant v0.4. The v0.5 version added the ability to apply either a WGS model or an exome model. The v0.6 version specifically added PCR+ training for the first time.

All evaluations use the hs37d5 reference and v3.3.2 Truth Sets from NIST Genome in a Bottle. Evaluation is performed with in the same method used in PrecisionFDA.  Google’s release notes indicate they never train on HG002 or chr20 of any sample.

On exomes, DeepVariant improved from approximate parity to a 2-fold error reduction relative to GATK. There are two ways to put this improvement in accuracy into context. DeepVariant has an absolute decrease of 231 errors and 121 false positives. For a callset of 500,000 exomes – as the UK BioBank will soon make available – the fact that false calls tend to distribute randomly while true variants are shared in the population could mean a reduction of 60 million false variants positions in the full callset (for reference, the 60,000 exome ExAC set contained 7 million variant positions).

Another way to frame the value is: does the improved accuracy allow similar performance as traditional methods at a lower coverage?Figure 2 demonstrates that DeepVariant on the same exome downsampled randomly to 50% coverage is more accurate than other methods at full coverage. This could allow a re-evaluation of required sequencing depth to either increase throughput, save cost, or refocus sequencing depth on complex but important regions.

Including PCR+ WGS Data Greatly Improves DeepVariant Performance

DeepVariant was not specifically trained on PCR-prepared WGS samples until the v0.6 release, when 5 PCR+ genomes were added to the training set. Even before this addition, DeepVariant was the most accurate SNP caller surveyed in our PCR+ data. However, as Figure 3A indicates, the inclusion of PCR training data improved SNP accuracy noticeably.

Indels in these samples were a problem for almost all of the callers. Indel error rates were so much higher in these samples that this is the dominant error mode across every method. The exception was Strelka2, which performed vastly better. Illumina clearly put great care into modeling PCR errors when developing Strelka2. Edico was also able to use this observation to improve their DRAGEN method, as detailed in our recent blog.

This observation was key to realizing that the performance penalty was due to PCR. When facing a problem with additional noise and complexity, it can be unclear whether the lack of signal makes the problem fundamentally less solvable or if the problem remains similarly tractable, but instead requires more effort. Strelka2’s performance was proof that current methods could improve significantly.

As Figure 3B shows, the inclusion of PCR+ data has a huge impact on the error rate of DeepVariant, with a 3-fold error reduction. Simply identifying the right training examples was sufficient to go from worse than GATK to 10% better than the prior leader, Strelka2.

How Do New Training Data Impact Other Applications?

By including PCR+ and exome data in training, DeepVariant is now being asked to encode more information within its networks. Asking methods to multi-task in this way can lead to interesting effects and can force trade-offs which decrease overall performance.

An interesting phenomenon observed with deep learning methods is that challenging them with diverse, hard problems can sometimes improve general performance. Ryan Poplin, one of the authors of DeepVariant, discusses this phenomenon in a recent ML/AI Podcast on his work on an image classification method for diabetic retinopathy.

This could be understood in a few ways. First, harder problems may create pressure for network weights to identify subtle but meaningful connections in data. For example, a strong deep learning approach trained on novice chess opponents would only be so good. Another is that training examples that depend differentially on various aspects of a problem helps resist overfitting, allowing the model to discover and preserve subtle signals it could not otherwise. The result is a better generalist model.

There are many variables changed in the progression of DeepVariant from v0.4 to v0.6 – more overall training data has been added and tensors are changed – but if DeepVariant were to benefit from the complexity, one might hypothesize this would manifest in a more pronounced change on Indel performance, as that area is the most impacted in both the exome and PCR data. To test this, we ran each version on a “standard” benchmark – PCR-Free WGS HG002.

The new releases of DeepVariant show a trajectory of improvement, which is impressive given that DeepVariant led our benchmarks in accuracy even in its first release. SNP error rate is about 10% improved. The Indel error rate is improved more substantially, at a 40% reduction. Also, the greatest impact occurs with the addition of the PCR training data in v0.6. With this sort of uncontrolled experiment, we can’t conclusively say that this occurs because DeepVariant is learning something general about Indel errors from the PCR data. However, the prospect is enticing.

What Will Deep Learning Bring for the Field?

As deep learning methods begin to enter the domain of bioinformatics, it is natural to wonder how skills in the field will shift in response. Some may fear these methods will obsolete programming expertise or domain knowledge in genomics and bioinformatics. In theory, the raw methods to train DeepVariant on these new data types, confirm the improvement, and prevent regressions can be shockingly automatic.

However, to reach this point, domain experts had to understand the concepts of PCR and exome sequencing and identify their relevance to errors in the sequencing process. They had to understand where to get valid, labeled data of sufficient quantity and quality. They had to determine the best ways to represent the underlying data in a trainable format.

The importance of domain expertise – what is sequencing and what are its nuances – will only grow. How it manifests may shift from extracting features and hand-crafting weights to identifying examples that fully capture that nuance.

We also see these methods as enabling, not deprecating, programming expertise – these frameworks depend on large, complex training and evaluation infrastructures. The Google Brain team recently released Nucleus, a framework for training genomics models. Nucleus contains converters between genomics formats like BAM and VCF and TensorFlow. This may allow developers to tap into deep learning methods as specialized modules of a broader bioinformatics solution.

We hope that amongst the detail, this blog has communicated that deep learning is not a magic box. It remains essential for a scientist to carefully consider what the nuances of a problem are; how to rigorously evaluate performance; where blind spots in a method are; and which data will shine light on them.

Evaluating the Performance of NGS Pipelines on Noisy WGS Data

In this blog post, we explore variability in the quality of sequencing runs. We present a method which allows the generation of benchmarking sets of real sequence data across arbitrary quality ranges. Finally, we present benchmarks of popular variant calling pipelines across these diverse quality sets.


The recent generation of gold-standard genome datasets – such as the Genome in a Bottle (GIAB) and Platinum Genomes – has given greater understanding about the quality of sequencing. Competitions held on PrecisionFDA have driven the innovation and refinement of new methods. Sentieon, Edico, and DeepVariant, each a winner of one or more of Consistency, Truth, and Hidden Treasures Challenges, have proven the power of standards to drive innovation. These gold-standard datasets used to construct truth sets and challenges were produced with great care and attention to quality.

We wondered how directly these sets reflect common, real-world sequencing data. “Quality” of sequencing is complex – factors such as library complexity, coverage, contamination, index hopping, PCR errors, and insert size distribution are all important. For our exploration, we use a quantifiable, unambiguous, and straightforward measure – the sequencer-reported base quality.

FASTQ base qualities estimate the probability that a base call (A, T, G, or C) is in error. These errors make reads more difficult to correctly map and produce noise that variant callers must contend with. Monitoring of base qualities is often a component in QC.

We score each pair of forward and reverse reads with the expected value for the number of errors present in the pair. The combination of these values for each read pair in a sequencing run creates a distribution with a mean and standard deviation for errors.

Comparing GIAB, PrecisionFDA Sets to General Data

First, we investigate the Illumina HiSeq2500 contributed to GIAB. These data, along with contributions of PacBio, Complete, 10X, BioNano, Oxford Nanopore, and Ion Torrent were used to create the GIAB truth sets.  

As Figure 1 indicates, there is some variability in quality – HG003 samples have about 30% less of an error rate than HG001. However, overall base accuracy is high.

We selected 3 readily available HiSeq2000/HiSeq2500 sets from SRA and performed the same calculation.  As Figure 2 illustrates, other datasets range from modestly higher (~3x) to dramatically higher (~20x) error rates. The blue dots in this figure are the same as all of the dots in Figure 1.

We applied a similar investigation to datasets from the HiSeqX. In Figure 3, we compare predicted error rates for the 2016 Garvan WGS dataset contributed to the PrecisionFDA Truth challenge, several HiSeqX replicates produced by Illumina and available on BaseSpace, several HiSeqX runs performed by Academic Medical Centers, and the first publicly available 2014 HiSeqX run from Garvan that we noted as difficult for Indel calling in a prior blog post.

With these HiSeqX sets, there is a 5-fold variability in predicted error rate between the highest and the lowest samples.  How do differences of this magnitude relate to the quality of variant calls from a sample? Answering this question is not easy since the diversity of datasets are not on the genomes (HG001, HG002, or HG005) that have truth sets for. To address this, we have developed a method called Readshift, which allows us to resample existing runs to create real datasets of diverse read quality.


Readshift uses biased downsampling from a dataset of larger coverage to create a shifted quality distribution at lower coverage. (See Figure 4 for a conceptual example.)

It does so by modeling the original and final samples as normally distributed and calculating what the probability for inclusion of each read would be required to reach the desired distribution.

Readshift does not force the final distribution to be normally distributed. Rather, it acts as a transformation which allows shifting of a distribution in a way that preserves the shape of the original distribution.

Importantly, the data produced by Readshift is not simulated – these are real reads produced by a real sequencer and thus carry all of the intricacies, biases, and idiosyncrasies associated with real data. No individual base or quality value is altered.

All of the code for Readshift is released with an open source license on Github  – The FASTQ data in this investigation are available in a public DNAnexus project for download or general use, as are the Readshift apps.

Shifting HiSeq2500, HiSeqX, and NovaSeq Datasets

Several high-coverage datasets exist for HG001 on the HiSeq2500, HiSeqX, and NovaSeq platforms. We took 200X HiSeq2500 data submitted for GIAB, 200X HiSeqX data available on BaseSpace, and 350X Novaseq data available on BaseSpace and applied Readshift to increase the predicted errors in reads by 0.5, 1.0, 1.5, and 2.0 standard deviations relative to the prior mean. To confirm that no artifacts are introduced by the shift machinery, we also applied Readshift with a shift of +0.0 standard deviations.

This range roughly covers the difference between the best and worst HiSeqX sets shown above. The magnitude of the possible shift is determined by the amount of starting coverage – higher coverage allows larger shifts without clipping the distribution.

Figure 5 shows the distribution of read errors after the shift. Though we use the probability density function of the normal to do the transformation, shifting qualitatively preserves the distribution shape. Both original and final distributions are left-leaning curves. At the higher degrees of shift, much of the decline in read quality appears driven by reads with a strong decline in quality toward the 3’ end, or reads with a bad mate.

Evaluating WGS Pipelines on Noisy Datasets

We evaluate several popular WGS variant calling pipelines on this shifted data starting from 35X coverage FASTQ files. All analyses are done against the hg38 reference genome. Accuracy is assessed on the confident regions using the Genome in a Bottle Truth Sets, with the evaluation performed using the same method that we use in PrecisionFDA.

Figure 6 demonstrates that tools behave differently when used to call increasingly noisy data . Freebayes begins to suffer immediately, while Strelka2 suffers once the shift exceeds 0.5 standard deviations. Even the most resistant programs perform significantly worse (GATK 100%, DeepVariant 68% more errors) at a shift of 2.0 standard deviations. Sentieon avoids downsampling to achieve reproducibility, but otherwise matches GATK3.

The SNP results for HiSeqX follow a similar pattern as indicated in Figure 7.

For NovaSeq (Figure 8), though the data is qualitatively similar to the HiSeq2500 and HiSeqX runs, other complexities occur. Generally, low-quality runs took significantly longer to finish for this data. Some programs either failed or produced results that suggest they did not call several regions for high coverage. The NovaSeq uses 2-color chemistry which does not enable it to indicate an unknown base as N. This may make reads with more errors hard to uniquely map, causing an increase in secondary alignments and regions of unmanageable coverage.

This suggests that filtering of NovaSeq data may be more important than with previous sequencers in low-quality runs. (GATK3+1.0, GATK4+1.0, Strelka+1.5, and DeepVariant+2.0 were impacted).


The results with Indel calling was different than those for SNPs. Apart from Freebayes, none of the variant callers suffer an accuracy penalty on the lower-quality data.

In the HiSeqX data (Figure 10), the Indel error rate is much higher even in the original data. This is the same phenomenon we observed in the early 2014 Garvan HiSeqX run we found in a prior blog post; however, these are different runs. It is possible that many early HiSeqX runs have this issue, or that some other aspect of the preparation is at fault.

Interestingly, though Indel callers arerobust against these extreme shifts in read quality, they suffer strongly from whatever occurred in the HiSeqX runs.

The NovaSeq data (Figure 11) is qualitatively similar to the HiSeq2500 results.

The fact that the callers exhibit greater robustness against Indel errors could be an opportunity to improve SNP calling. Although the fact that the definition of base quality values map more directly to SNP events than Indels, reads in the higher shifts often have long stretches of low-quality bases toward the 3’ end. Based on this, it seems likely that something in the way tools handle indels mitigates the low-quality reads. This could be a heavier reliance on haplotype construction, Indel realignment, or local assembly to resolve Indels. It could also be that since Indels span multiple bases, there is a greater signal to distinguish random errors from meaningful differences. If this is the case, this ability to gain nearby corroborating evidence might be achievable by phasing nearby variants or looking whether nearby potential errors co-vary.  

Increased Computational Burden on Low Quality Sets

As Figure 12 and Figure 13 indicate, lower quality samples carry a higher computational burden. For BWA, reads become harder to uniquely map, which also causes an increase in secondary alignments.

This problem is more severe for NovaSeq data, perhaps because 2-color chemistry lacks a way to designate a base as “N” (“no color” corresponds to “G”) for NovaSeq.  

For GATK3 HaplotypeCaller, this can cause samples to require 3x as much compute, even at moderate shifts. We believe that GATK3 is affected because secondary align reads pile up in poly-G/poly-C regions of the genome, and this generates a large number of candidate haplotypes to consider.

We think that many programs may suffer similar impacts, particularly any process which considers across candidate events at low confidence (e.g. many possible haplotypes) and selects the best choices. This may be worse for methods in somatic calling, or detection of events in cell-free DNA. Even in pristine reads with a 3/1000 error rate, any given mismatch is 3x more likely to be a sequencing error than the human variation rate. Germline callers manage this because the signal from germline events is consistent. Somatic callers must consider candidates at low frequencies or risk losing rare, but impactful, events. We hope to discuss this and strategies for mitigation in a future post. Note that the strong majority of NovaSeq runs are just fine – and standard read filtering tools seem to work for those that have issues.


By providing this analysis, we hope to highlight the fact that not all sequencing runs are the same and that this is important to their analysis, the choice of tools for them, and how we interpret tool evaluations on gold standard data. By releasing both the code  to perform the shift and the data (, we aim to help the community improve the accuracy and robustness of their methods. Readshift can be applied to any emerging benchmark as long as high coverage sequencing exists, such as the CHM1-CHM13 dataset developed by Heng Li et al.

This work raises a number of additional areas we have begun to investigate, including the best strategies to filter reads to protect against low-quality sets, the differences in impact on precision versus recall across tools and sequencing platforms, and the potential for ensembling analytical methods to shield against outlier sequencing runs.

If you are interested in collaborating with us on this work, we want to engage scientific partners. If you have sequence datasets on NA12878/HG001, HG002, or HG005, particularly from runs that turned out to be bad, and want to help us use them to make better benchmarking resources: Readshift only gets stronger with more data. Please reach out to us at