Considering an IT Investment For Your Clinical Diagnostics Lab? Help is Here

Next-generation sequencing data analysis management systems are the backbone of clinical diagnostics labs, providing vital functionality for day-to-day workings and business delivery. As a result, decisions about the tools to invest in can have far-reaching implications. 

Would it be best to improve, augment or entirely replace your existing systems? 

Which is the most resource intensive? 

Cost effective? 


These aren’t the only questions.

Other considerations include scalability and future expansion potential, flexibility, and impacts on operations and turnaround time. 

It can be overwhelming. Luckily, we’re here to help, with two new guides to help you assess your current and future needs, and how to achieve them. 

Build vs. Buy: Six Questions to Consider When Investigating Clinical Diagnostics Informatics Solutions

Feeling like a victim of your own success? As testing demand increases, so does complexity, and IT issues that were once hidden are probably starting to surface. Experiencing unplanned downtime, difficulty deploying new pipelines, and challenges in finding and accessing data? These could bottleneck future growth.

Sounds like you need a reliable system that is prepared to grow alongside you. Custom-built Cloud solutions provide scalability, security, and lots of other benefits, which we outline in this Whitepaper. We’ll walk you through the decision-making process, and point out the advantages and disadvantages of the different IT solutions.

10 Tips to Scale Your Diagnostics Business & Grow Your Test Portfolio Globally

Ready to take your business to the next level? Great! If that involves expanding your footprint, either locally or globally, things could get complicated, especially when it comes to data sharing, storage and access. Are you prepared to navigate all the data sovereignty requirements and IP protection across locations? Do you know how to control access to pipeline algorithms and sensitive health data, and to comply with regionally-specific regulations?

By bringing all production pipelines into a single, unified cloud environment, with version-controlled pipeline updates rolled out simultaneously across your locales, you can effectively manage growth. This Whitepaper lays it all out, and explain why letting experts do the heavy lifting can pay dividends.

Still not convinced? Perhaps this handsome devil will help you decide:

Examining Variation from Wet-Lab Protocol Choices in Microbiome Data through the Mosaic Standards Challenge

Sam Westreich

The Need for the Mosaic Standards Challenge

Study of the human gut microbiome – the collection of bacteria that dwell in the lower intestinal tract of every person – is a challenging task.  Given the sheer number of bacteria present, along with the diversity in numbers of different species present, analyzing this environment requires collecting huge amounts of sequencing data in order to make an accurate profile of microbial composition.

Because of the complexity of this environment, it’s important to control all sources of potential variation due to experimental design.  Many researchers focus on making sure that they use a consistent bioinformatics pipeline, but this isn’t the only source of methods-based variation – almost every choice in the experiment, from the extraction kit used to the sequencing method chosen, has the potential to skew the results of a microbiome examination.

To examine these sources of variation and better determine the magnitude of their effects, DNAnexus partnered with the Janssen Human Microbiome Institute (JHMI), the National Institute for Standards and Technology (NIST), and the BioCollective to launch the Standards challenge.  We offered to provide a sample kit to any participant who joined the challenge.  Each participant agreed to extract and sequence these provided samples, and provide Mosaic with the FASTQ files and the details of their processing, extraction, and sequencing protocols.

The Standards Challenge Sample Kit

On the Mosaic platform, we provide information about the Standards challenge, as well as a link to join the challenge.  Once a participant joins the challenge, they can place a free online “order” for a kit.  Each participant is allowed up to 3 kits, so they can compare and contrast their results using different wet-lab protocols.

Each kit contains seven samples.  Five are fecal samples, provided by the BioCollective, with each sample derived from a different human donor.  The kit also contains two purified DNA samples, provided by NIST, which contain DNA from known bacteria at predetermined relative concentrations.  The BioCollective and NIST constructed 700 of these kits, which are shipped for free to any participant around the globe.

Gathering Protocol Metadata

Metadata Submission Form

In order to properly determine what steps in the wet-lab sample processing, amplifying, and sequencing protocols have the greatest influence on variation in the sample results, we worked with several researchers and contract research organizations (CROs) to create a list of 99 questions about the protocols.

Answering 99 protocol questions for each sample is a lot of work, so we added several modifications and caveats.  First, in order to submit results, only two questions need to be answered – whether the submitted sample is paired-end (to know how many input files to expect per sample), and whether the sample was analyzed using 16S rRNA profiling, or if metagenomic sequencing was performed (to properly segregate samples).  If the participant doesn’t know the answer to a specific question, they are free to skip it without invalidating their submission.

However, we also want to incentivize participants to complete enough metadata questions to provide useful information to analyze.  All responses that are submitted to us are anonymized, allowing commercial groups to submit without fear of potentially tarnishing their brand.  We selected a subset of the protocol questions (29 questions) that we marked as “preferred”.  If a participant answered all of the preferred questions for a sample, we would reveal, to that participant only, the anonymized ID of that particular sample.  This encouraged participants to submit metadata answers for all preferred questions so that they could see where their individual samples fell in comparison to others.

Analyzing the Results with a Consistent Pipeline

To analyze the submitted results, we worked with Dan Knights and Gabriel Al-Ghalith at the University of Minnesota, using their SHI7 and BURST tools to perform basic quality control and species-level profiling of the submitted FASTQ data.  Because we want to examine variation due to differences in wet-lab protocol, we made sure that the bioinformatics pipeline was identical for all samples.  This pipeline performed the following steps:

  1. Ran SHI7 (v0.9.9) to provide quality control and filtering on input files
  2. Removed human-aligned reads using Bowtie2
  3. Aligned against NCBI’s Representative Genomes collection (synced to RefSeq v82) using BURST12 v0.99.7
  4. Anonymized file headers and created a tarball of intermediate, anonymized results
  5. Used QIIME2 to convert files to BIOM format, merge together all sample files for each of the 7 samples, and calculated euclidean beta diversity and principal coordinates
  6. Merged the principal coordinates data with the collected protocol metadata to create an interactive figure using Emperor

The intermediate tarballs, each containing the raw, anonymized FASTQ reads and the BIOM results files for each sample within that tarball’s submission, are made publicly available in the May 2019 data freeze workspace on Mosaic.  The final Emperor plot may be viewed in the Results Exploration tab of the Mosaic Standards challenge page.

Exploring the Initial Standards Challenge Results

Examining the interactive Emperor plot reveals that, while the samples do segregate on the PCoA plot based on their sample ID, there is distinct variation seen among responses for each of the seven samples distributed in the sample kit.

Figure 1
Figure 1b

Figure 1: Examination of all WGS samples (left) and all 16S samples (right), colored by sample number within the kit.  Note that the yellow and cyan points correspond to the purified DNA samples, and are more diversely distributed than the five fecal samples.

The majority of variation seen among different samples on the initial Emperor plots appears to be due to differences between the results of the fecal samples (left, outside of circle) and the results of the purified DNA samples (right, inside yellow and cyan circles) supplied in each kit.  When we remove these DNA samples from the graph and focus solely on the fecal samples, we see more distinct clusters for each of the five fecal samples.  We speculate that the reason the DNA samples are more spread out is because they contain substantially fewer distinct taxa, allowing noise to play a more outsized role, as well as exacerbating the effect of individual misidentified taxa.

Figure 2 Figure 2b

Figure 2: Examination of the five fecal samples, looking at WGS results (left) and 16S results (right).  Color indicates the number of the sample, of the five fecal samples distributed in the kit. 

Interestingly, all beta diversity patterns show prominent clustering that is arranged in a linear pattern by sample. While each of the 5 stool samples, as well as the purified DNA samples, can be graphically distinguished, it is notable that the stool samples using WGS sequencing have the least jitter and form reasonably straight lines. We speculate this is due to the more limited reference database used for WGS profiling constraining variability in the references chosen, while the larger 16S database, in tandem with small amplicons, amplifies the effects of subtle sequence variations in influencing the reference taxon chosen. With 16S sequencing, there is also more opportunity for various biases, including different primers, PCR amplification strategies, and amplicon length to influence the outcome than with shotgun metagenomics, which is expected to produce more uniformly random coverage of genomes present.

The fact that PCoA essentially assigns a separate axis to each sample (especially in the WGS case) may be a promising sign, as it appears each sample is much more like itself than any other, allowing samples to be distinguished regardless of which lab produced them. However, the 16S samples do not share this quality to the same extent, as we see some evidence of certain samples appearing to impinge on the trajectories of others, depending on which lab produced them.

Further Analysis of the Available Data

16S rRNA Sequencing Data

For the 16S data, we used QIIME2 to further analyze the raw data, annotating against the QIIME2-provided “GreenGenes 99% OTUs full-length sequences” reference.

Examining the 16S data at the family level of annotation, we note that several microbial families appear unique to data from one or two particular individuals, allowing results files to be identified as having been derived from these particular samples.  For example, the Prevotellaceae family was most abundant in results derived from fecal sample 5 within the kits, and also present at a lower level in fecal sample number 4.

Figure 3

Figure 3: A QIIME2 visualization of the 16S taxonomy results for each submitted sample, sorted by sample number and colored at the family level.  Samples 1-5 are fecal, with 6-7 appearing distinctly different on the right side. 

Figure 4

Figure 4: The legend for Figure 3, providing annotations at the family level.  A link is provided below to the QIIME2 visualization file, and the data can be explored on QIIME2’s website (

Looking at the DNA samples, we observe that the Enterobacteriaceae family is the best distinguisher of whether a sample was from the A or B mock communities distributed as pre-extracted DNA.

We provide the QIIME2 visualization file here; this can be further explored, and sorted by any metadata question, by using the online viewer at

Metagenomic Sequencing Data

We further analyzed the metagenomic sequencing data using Kraken to annotate the data against the standard Kraken database, which includes bacterial, archaeal, and viral genomes from RefSeq.  Kraken results were exported in MPA format so that they could be merged using MetaPhlAn2, which was also used for generating comparative heatmaps of the results.

Figure 5

Figure 5: A heatmap of all metagenomic sequencing data, generated by MetaPhlAn2.  Results largely cluster by sample number, with the fecal samples (1-5) on the left and the DNA samples (6-7) on the right.  

Examining a heatmap of all metagenome samples shows a clear distinction between the fecal samples (samples 1-5) and the purified DNA samples (samples 6-7).  Within the fecal samples, we see fairly clear separation of most of the samples, although some results for sample 1 are more similar to sample 4 than they are to each other.  Samples 2, 3, and 5 all segregate completely.  We additionally observe one outlier sample, in sample 1, on the left side of the heatmap.  This sample appears highly divergent from all others, and appears to be due to a truncated file being provided, with far fewer reads than other files.

We see additional results when we generate the heatmaps for each individual sample, determining similarity of microbiome profiles by top twenty most abundant organisms.

WGS sample heatmaps

Figure 6: heatmaps produced using MetaPhlAn2 from Kraken results for metagenomic sequencing data for each individual sample.  The fecal samples (1-5) are more similar than the DNA samples (6-7), most likely due to the DNA samples containing material from fewer total species, and thus being more prone to distortion from mis-identification.

Here, we can see that there’s overall very little variation within each sample for the fecal samples, with more variation observed in the purified DNA samples.  This is to be expected, as the purified DNA samples contain DNA from fewer organisms and are thus more prone to mis-identification.  Looking at these individual sample heatmaps, we can also easily distinguish which samples were replicates from the same participant, using the same method; for example, submissions sub-cD69TaSgs3kiPLcjVTqzfkoQ, sub-E8p1NnXxrsEyXsS4dUZaPwvk, and sub-Fb2pkVAPdAR2383mVGL7xCU9 are all replicates.

Summary and Next Steps

This initial examination of the Mosaic Standards challenge data has yielded several interesting insights about the data collected so far from participants in the Standards challenge.  Results overall have shown a high degree of consistency when performed by the same laboratory using an identical method, but we observe significant variation between results provided by different groups.  By comparing samples based on taxonomic similarity, we are able to cluster results by their sample of origin, more easily in metagenomic samples than in 16S results.

These current examinations of the data have compared different results, but haven’t yet attempted to draw ties between protocol choices in metadata and variation in taxonomic results.  Future investigations will focus on examining which protocol choices are correlated with the greatest levels of variation, connecting the metadata with the taxonomic results of analyzing the submitted data.

The Mosaic Standards challenge is still open for submissions – it’s still possible to join the challenge, receive a FREE sample kit, and submit your results.  As additional responses roll in from participants, we will continue to provide regular data freezes and additional analysis and insights from this evolving and multidimensional dataset.  Sign up to receive a free sample kit here:

DNAnexus Detectives: Using Amazon Web Services to Help Solve a Medical Mystery

Jason Chin Chiao-Feng Lin AuthorsArkarachai Fungtammasan AuthorOur mission, and we chose to accept it, was to join more than 100 researchers and engineers to look for answers and create insights into a real patient’s mystery medical condition.

In this case, that patient was John, aka “Undiagnosed-1”, a 33-year-old Caucasian male suffering from undiagnosed disease(s) with gastrointestinal symptoms that started from birth. In his 20’s, John’s GI issues became more severe as he began to have daily lower abdominal pain characterized by burning and nausea. He developed chronic vomiting, sometimes as often as 5 times per day. Now 5’10” tall and 109lbs, he is easily fatigued due to his limited muscle mass and low weight.

Armed with more than 350 pages of PDFs containing scanned images of John’s medical records plus a range of genetic data — from Invitae’s testing panel to whole genome shotgun sequencing from multiple technologies (Illumina, Oxford Nanopore and PacBio) — could we generate ideas for diagnosis, new treatment options or symptom management? Perhaps we could interpret variants of unknown significance or identify off-label therapeutics through mutational homogeneity.

This was the challenge set to us as part of a three-day event in June organized by SV.AI, a non-profit community designed to bring together bright minds from AI, machine learning and biology backgrounds to solve real-world problems. This was its third event. Last year, we helped to apply DeepVariant from Google on a new kind of sequencing data to help out on a rare kidney cancer case.

Off to a flying start

At DNAnexus, our main mission is to help our customers to process large amounts of genomic data with cloud computing. It is straightforward for us to do the initial processing of the genomic data. In this case, our customers were the event’s community of genomic “hackers.” We decided to pre-process John’s genomic data, so that our fellow participants would not have to spend extra effort to go through variant calling.

Chai generated SNPs and structural variants before the event, and made the information available to everyone who might need it.

However, genomic data was only half of the picture. Clearly, the clinical information gleaned from John’s medical records would shed some clues. But how to get through hundreds of pages of scanned images of his medical records (under MIT licence for invited participating scientists). There must be a smarter way to process the records that would allow us — and others — to write scripts and programs to process them.

Luckily, there is: Amazon Textract and Amazon Comprehend Medical.

Developed by Amazon Web Services (AWS) using modern machine learning technology, Amazon Textract is a service that automatically extracts text and data from scanned document — think OCR on steroids. While there are many OCR software applications on the market these days, Textract provides more; it detects a document’s layout and the key elements on the page, understands the data relationships in any embedded form or table, and extracts everything with its context intact. This makes it possible for a developer to write code to further process the output data in text or JSON format to extract important information more efficiently. It is important to note that at the time of this blog post, Amazon Textract is not a HIPAA eligible service.  We were able to use it in this case because the patient data being analyzed was de-identified.  Amazon Textract should not be used to process documents with protected health information until it has achieved HIPAA eligibility.  Please check here to determine if Amazon Textract is HIPAA eligible.

Another technology developed by AWS, Amazon Comprehend Medical, was used to process the output from Textract for John’s medical records. Amazon Comprehend Medical is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Amazon Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records. Using it, we were able to extract medications, historical symptoms, and medical conditions from John’s doctors’ notes and testing/diagnostics reports.

With more structured patient information than merely a collection of medical records as images, combined with the variant calls generated from the genomics data, the participants of the hackathon were able to jump right into solving the medical and genomic puzzles for John to help him and relieve his symptoms.

The results

We were happy that a couple of teams from the hackathon were able to use both the variant call set and the processed medical data that we provided.

Genomics Info Word CloudA nice word cloud generated from John’s medical record and genomics information by the “Too Big to Fail John” team. (Origin:

  • The “Thrive” team used a similar approach to find potential variant candidates by identifying variants that are commonly seen with John’s medical conditions and predicted deleterious variant calls.

Thrive Variant Calling Schema

The schema of the Thrive team approach to analyzing both the medical records and variant call set.

  • The team Crucigrama extended the scope by including other public ‘omics data, such as metabolic profiles and NLP process, for public genomics data.

Crucigrama Problem StatementTeam Crucigrama’s problem statement to extend the traditional approach to finding new leads for Undiagnosed-1.

  • The “Beyond Undiagnosed” team also utilized the medical record in their workflow so they could gather key symptoms and diagnoses fast, to provide future care recommendations according to their findings.

Beyond Undiagnosed Extracted SymptomsThe “Beyond Undiagnosed” team used the extracted symptoms from the medical notes in their workflow for providing recommendations. All information was de-identified and did not contain PHI.

We found it very inspiring that John, who attended the event, was willing to share his medical records with all of the participants in order to help us help him — and we hope the work we did will ultimately do so.

For more information about John’s case, visit the SV.AI site, which will provide all data so that any researchers can continue working on it. We also like to thank SV.AI organizing this event and Mark Weiler and Lee Black from Amazon helping us on processing the data through Amazon Textract and Amazon Comprehend Medical.