Examining Variation from Wet-Lab Protocol Choices in Microbiome Data through the Mosaic Standards Challenge

Sam Westreich

The Need for the Mosaic Standards Challenge

Study of the human gut microbiome – the collection of bacteria that dwell in the lower intestinal tract of every person – is a challenging task.  Given the sheer number of bacteria present, along with the diversity in numbers of different species present, analyzing this environment requires collecting huge amounts of sequencing data in order to make an accurate profile of microbial composition.

Because of the complexity of this environment, it’s important to control all sources of potential variation due to experimental design.  Many researchers focus on making sure that they use a consistent bioinformatics pipeline, but this isn’t the only source of methods-based variation – almost every choice in the experiment, from the extraction kit used to the sequencing method chosen, has the potential to skew the results of a microbiome examination.

To examine these sources of variation and better determine the magnitude of their effects, DNAnexus partnered with the Janssen Human Microbiome Institute (JHMI), the National Institute for Standards and Technology (NIST), and the BioCollective to launch the Standards challenge.  We offered to provide a sample kit to any participant who joined the challenge.  Each participant agreed to extract and sequence these provided samples, and provide Mosaic with the FASTQ files and the details of their processing, extraction, and sequencing protocols.

The Standards Challenge Sample Kit

On the Mosaic platform, we provide information about the Standards challenge, as well as a link to join the challenge.  Once a participant joins the challenge, they can place a free online “order” for a kit.  Each participant is allowed up to 3 kits, so they can compare and contrast their results using different wet-lab protocols.

Each kit contains seven samples.  Five are fecal samples, provided by the BioCollective, with each sample derived from a different human donor.  The kit also contains two purified DNA samples, provided by NIST, which contain DNA from known bacteria at predetermined relative concentrations.  The BioCollective and NIST constructed 700 of these kits, which are shipped for free to any participant around the globe.

Gathering Protocol Metadata

Metadata Submission Form

In order to properly determine what steps in the wet-lab sample processing, amplifying, and sequencing protocols have the greatest influence on variation in the sample results, we worked with several researchers and contract research organizations (CROs) to create a list of 99 questions about the protocols.

Answering 99 protocol questions for each sample is a lot of work, so we added several modifications and caveats.  First, in order to submit results, only two questions need to be answered – whether the submitted sample is paired-end (to know how many input files to expect per sample), and whether the sample was analyzed using 16S rRNA profiling, or if metagenomic sequencing was performed (to properly segregate samples).  If the participant doesn’t know the answer to a specific question, they are free to skip it without invalidating their submission.

However, we also want to incentivize participants to complete enough metadata questions to provide useful information to analyze.  All responses that are submitted to us are anonymized, allowing commercial groups to submit without fear of potentially tarnishing their brand.  We selected a subset of the protocol questions (29 questions) that we marked as “preferred”.  If a participant answered all of the preferred questions for a sample, we would reveal, to that participant only, the anonymized ID of that particular sample.  This encouraged participants to submit metadata answers for all preferred questions so that they could see where their individual samples fell in comparison to others.

Analyzing the Results with a Consistent Pipeline

To analyze the submitted results, we worked with Dan Knights and Gabriel Al-Ghalith at the University of Minnesota, using their SHI7 and BURST tools to perform basic quality control and species-level profiling of the submitted FASTQ data.  Because we want to examine variation due to differences in wet-lab protocol, we made sure that the bioinformatics pipeline was identical for all samples.  This pipeline performed the following steps:

  1. Ran SHI7 (v0.9.9) to provide quality control and filtering on input files
  2. Removed human-aligned reads using Bowtie2
  3. Aligned against NCBI’s Representative Genomes collection (synced to RefSeq v82) using BURST12 v0.99.7
  4. Anonymized file headers and created a tarball of intermediate, anonymized results
  5. Used QIIME2 to convert files to BIOM format, merge together all sample files for each of the 7 samples, and calculated euclidean beta diversity and principal coordinates
  6. Merged the principal coordinates data with the collected protocol metadata to create an interactive figure using Emperor

The intermediate tarballs, each containing the raw, anonymized FASTQ reads and the BIOM results files for each sample within that tarball’s submission, are made publicly available in the May 2019 data freeze workspace on Mosaic.  The final Emperor plot may be viewed in the Results Exploration tab of the Mosaic Standards challenge page.

Exploring the Initial Standards Challenge Results

Examining the interactive Emperor plot reveals that, while the samples do segregate on the PCoA plot based on their sample ID, there is distinct variation seen among responses for each of the seven samples distributed in the sample kit.

Figure 1
Figure 1b

Figure 1: Examination of all WGS samples (left) and all 16S samples (right), colored by sample number within the kit.  Note that the yellow and cyan points correspond to the purified DNA samples, and are more diversely distributed than the five fecal samples.

The majority of variation seen among different samples on the initial Emperor plots appears to be due to differences between the results of the fecal samples (left, outside of circle) and the results of the purified DNA samples (right, inside yellow and cyan circles) supplied in each kit.  When we remove these DNA samples from the graph and focus solely on the fecal samples, we see more distinct clusters for each of the five fecal samples.  We speculate that the reason the DNA samples are more spread out is because they contain substantially fewer distinct taxa, allowing noise to play a more outsized role, as well as exacerbating the effect of individual misidentified taxa.

Figure 2 Figure 2b

Figure 2: Examination of the five fecal samples, looking at WGS results (left) and 16S results (right).  Color indicates the number of the sample, of the five fecal samples distributed in the kit. 

Interestingly, all beta diversity patterns show prominent clustering that is arranged in a linear pattern by sample. While each of the 5 stool samples, as well as the purified DNA samples, can be graphically distinguished, it is notable that the stool samples using WGS sequencing have the least jitter and form reasonably straight lines. We speculate this is due to the more limited reference database used for WGS profiling constraining variability in the references chosen, while the larger 16S database, in tandem with small amplicons, amplifies the effects of subtle sequence variations in influencing the reference taxon chosen. With 16S sequencing, there is also more opportunity for various biases, including different primers, PCR amplification strategies, and amplicon length to influence the outcome than with shotgun metagenomics, which is expected to produce more uniformly random coverage of genomes present.

The fact that PCoA essentially assigns a separate axis to each sample (especially in the WGS case) may be a promising sign, as it appears each sample is much more like itself than any other, allowing samples to be distinguished regardless of which lab produced them. However, the 16S samples do not share this quality to the same extent, as we see some evidence of certain samples appearing to impinge on the trajectories of others, depending on which lab produced them.

Further Analysis of the Available Data

16S rRNA Sequencing Data

For the 16S data, we used QIIME2 to further analyze the raw data, annotating against the QIIME2-provided “GreenGenes 99% OTUs full-length sequences” reference.

Examining the 16S data at the family level of annotation, we note that several microbial families appear unique to data from one or two particular individuals, allowing results files to be identified as having been derived from these particular samples.  For example, the Prevotellaceae family was most abundant in results derived from fecal sample 5 within the kits, and also present at a lower level in fecal sample number 4.

Figure 3

Figure 3: A QIIME2 visualization of the 16S taxonomy results for each submitted sample, sorted by sample number and colored at the family level.  Samples 1-5 are fecal, with 6-7 appearing distinctly different on the right side. 

Figure 4

Figure 4: The legend for Figure 3, providing annotations at the family level.  A link is provided below to the QIIME2 visualization file, and the data can be explored on QIIME2’s website (view.qiime2.org).

Looking at the DNA samples, we observe that the Enterobacteriaceae family is the best distinguisher of whether a sample was from the A or B mock communities distributed as pre-extracted DNA.

We provide the QIIME2 visualization file here; this can be further explored, and sorted by any metadata question, by using the online viewer at http://view.qiime2.org.

Metagenomic Sequencing Data

We further analyzed the metagenomic sequencing data using Kraken to annotate the data against the standard Kraken database, which includes bacterial, archaeal, and viral genomes from RefSeq.  Kraken results were exported in MPA format so that they could be merged using MetaPhlAn2, which was also used for generating comparative heatmaps of the results.

Figure 5

Figure 5: A heatmap of all metagenomic sequencing data, generated by MetaPhlAn2.  Results largely cluster by sample number, with the fecal samples (1-5) on the left and the DNA samples (6-7) on the right.  

Examining a heatmap of all metagenome samples shows a clear distinction between the fecal samples (samples 1-5) and the purified DNA samples (samples 6-7).  Within the fecal samples, we see fairly clear separation of most of the samples, although some results for sample 1 are more similar to sample 4 than they are to each other.  Samples 2, 3, and 5 all segregate completely.  We additionally observe one outlier sample, in sample 1, on the left side of the heatmap.  This sample appears highly divergent from all others, and appears to be due to a truncated file being provided, with far fewer reads than other files.

We see additional results when we generate the heatmaps for each individual sample, determining similarity of microbiome profiles by top twenty most abundant organisms.

WGS sample heatmaps

Figure 6: heatmaps produced using MetaPhlAn2 from Kraken results for metagenomic sequencing data for each individual sample.  The fecal samples (1-5) are more similar than the DNA samples (6-7), most likely due to the DNA samples containing material from fewer total species, and thus being more prone to distortion from mis-identification.

Here, we can see that there’s overall very little variation within each sample for the fecal samples, with more variation observed in the purified DNA samples.  This is to be expected, as the purified DNA samples contain DNA from fewer organisms and are thus more prone to mis-identification.  Looking at these individual sample heatmaps, we can also easily distinguish which samples were replicates from the same participant, using the same method; for example, submissions sub-cD69TaSgs3kiPLcjVTqzfkoQ, sub-E8p1NnXxrsEyXsS4dUZaPwvk, and sub-Fb2pkVAPdAR2383mVGL7xCU9 are all replicates.

Summary and Next Steps

This initial examination of the Mosaic Standards challenge data has yielded several interesting insights about the data collected so far from participants in the Standards challenge.  Results overall have shown a high degree of consistency when performed by the same laboratory using an identical method, but we observe significant variation between results provided by different groups.  By comparing samples based on taxonomic similarity, we are able to cluster results by their sample of origin, more easily in metagenomic samples than in 16S results.

These current examinations of the data have compared different results, but haven’t yet attempted to draw ties between protocol choices in metadata and variation in taxonomic results.  Future investigations will focus on examining which protocol choices are correlated with the greatest levels of variation, connecting the metadata with the taxonomic results of analyzing the submitted data.

The Mosaic Standards challenge is still open for submissions – it’s still possible to join the challenge, receive a FREE sample kit, and submit your results.  As additional responses roll in from participants, we will continue to provide regular data freezes and additional analysis and insights from this evolving and multidimensional dataset.  Sign up to receive a free sample kit here: https://www.platform.mosaicbiome.com/challenges/8

DNAnexus Detectives: Using Amazon Web Services to Help Solve a Medical Mystery

Jason Chin Chiao-Feng Lin AuthorsArkarachai Fungtammasan AuthorOur mission, and we chose to accept it, was to join more than 100 researchers and engineers to look for answers and create insights into a real patient’s mystery medical condition.

In this case, that patient was John, aka “Undiagnosed-1”, a 33-year-old Caucasian male suffering from undiagnosed disease(s) with gastrointestinal symptoms that started from birth. In his 20’s, John’s GI issues became more severe as he began to have daily lower abdominal pain characterized by burning and nausea. He developed chronic vomiting, sometimes as often as 5 times per day. Now 5’10” tall and 109lbs, he is easily fatigued due to his limited muscle mass and low weight.

Armed with more than 350 pages of PDFs containing scanned images of John’s medical records plus a range of genetic data — from Invitae’s testing panel to whole genome shotgun sequencing from multiple technologies (Illumina, Oxford Nanopore and PacBio) — could we generate ideas for diagnosis, new treatment options or symptom management? Perhaps we could interpret variants of unknown significance or identify off-label therapeutics through mutational homogeneity.

This was the challenge set to us as part of a three-day event in June organized by SV.AI, a non-profit community designed to bring together bright minds from AI, machine learning and biology backgrounds to solve real-world problems. This was its third event. Last year, we helped to apply DeepVariant from Google on a new kind of sequencing data to help out on a rare kidney cancer case.

Off to a flying start

At DNAnexus, our main mission is to help our customers to process large amounts of genomic data with cloud computing. It is straightforward for us to do the initial processing of the genomic data. In this case, our customers were the event’s community of genomic “hackers.” We decided to pre-process John’s genomic data, so that our fellow participants would not have to spend extra effort to go through variant calling.

Chai generated SNPs and structural variants before the event, and made the information available to everyone who might need it.

However, genomic data was only half of the picture. Clearly, the clinical information gleaned from John’s medical records would shed some clues. But how to get through hundreds of pages of scanned images of his medical records (under MIT licence for invited participating scientists). There must be a smarter way to process the records that would allow us — and others — to write scripts and programs to process them.

Luckily, there is: Amazon Textract and Amazon Comprehend Medical.

Developed by Amazon Web Services (AWS) using modern machine learning technology, Amazon Textract is a service that automatically extracts text and data from scanned document — think OCR on steroids. While there are many OCR software applications on the market these days, Textract provides more; it detects a document’s layout and the key elements on the page, understands the data relationships in any embedded form or table, and extracts everything with its context intact. This makes it possible for a developer to write code to further process the output data in text or JSON format to extract important information more efficiently. It is important to note that at the time of this blog post, Amazon Textract is not a HIPAA eligible service.  We were able to use it in this case because the patient data being analyzed was de-identified.  Amazon Textract should not be used to process documents with protected health information until it has achieved HIPAA eligibility.  Please check here to determine if Amazon Textract is HIPAA eligible.

Another technology developed by AWS, Amazon Comprehend Medical, was used to process the output from Textract for John’s medical records. Amazon Comprehend Medical is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Amazon Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records. Using it, we were able to extract medications, historical symptoms, and medical conditions from John’s doctors’ notes and testing/diagnostics reports.

With more structured patient information than merely a collection of medical records as images, combined with the variant calls generated from the genomics data, the participants of the hackathon were able to jump right into solving the medical and genomic puzzles for John to help him and relieve his symptoms.

The results

We were happy that a couple of teams from the hackathon were able to use both the variant call set and the processed medical data that we provided.

Genomics Info Word CloudA nice word cloud generated from John’s medical record and genomics information by the “Too Big to Fail John” team. (Origin: https://github.com/SVAI/Undiagnosed-1/tree/master/TooBigToFail)

  • The “Thrive” team used a similar approach to find potential variant candidates by identifying variants that are commonly seen with John’s medical conditions and predicted deleterious variant calls.

Thrive Variant Calling Schema

The schema of the Thrive team approach to analyzing both the medical records and variant call set.

  • The team Crucigrama extended the scope by including other public ‘omics data, such as metabolic profiles and NLP process, for public genomics data.

Crucigrama Problem StatementTeam Crucigrama’s problem statement to extend the traditional approach to finding new leads for Undiagnosed-1.

  • The “Beyond Undiagnosed” team also utilized the medical record in their workflow so they could gather key symptoms and diagnoses fast, to provide future care recommendations according to their findings.

Beyond Undiagnosed Extracted SymptomsThe “Beyond Undiagnosed” team used the extracted symptoms from the medical notes in their workflow for providing recommendations. All information was de-identified and did not contain PHI.

We found it very inspiring that John, who attended the event, was willing to share his medical records with all of the participants in order to help us help him — and we hope the work we did will ultimately do so.

For more information about John’s case, visit the SV.AI site, which will provide all data so that any researchers can continue working on it. We also like to thank SV.AI organizing this event and Mark Weiler and Lee Black from Amazon helping us on processing the data through Amazon Textract and Amazon Comprehend Medical.

DNAnexus Navigation and UI Changes

To keep the DNAnexus platform easy enough for anyone to use and powerful enough for expert users, our team has made some layout and user interface (UI) changes. While these updates are relatively minor, they make the platform look different than you might be used to, so we’ve outlined them below to help you find your way around and see what’s new.

New Look

The first thing you’ll probably notice is that the user interface looks cleaner, flatter, and more modern. The visual updates to the interface make it easier to look at and faster to find what you’re looking for. Primary actions will be on the right of the screen and be immediately noticeable. Certain areas of the site won’t have this updated look yet, but every part of the site is getting revamped in the next few months, so if it hasn’t changed yet, it will soon.


Projects UI

We have redesigned the Project list page with an easily filterable list of all your projects. A new “pin” feature allows you to mark your favorite projects and they will remain on top of the list! 

Project List

Projects now display a line of summary text in the main list. You can add even longer text in the Descriptions section of the Info panel.

Reference Data List

Info Panel

We have added a new “info” panel which allows you to quickly inspect any project when you select its row. The info panel can be opened by clicking the “i” icon in the upper right. This shows information (metadata, project settings, project size, etc.) which is also available in the Project Settings. Now you can access this information directly from the Project list page. The info panel also lets you easily copy the project ID.

Pin Project

The context (three-dots) menu in each row gives you a shortcut to Leave or Delete a project (depending on your access level), share, pin and view project settings.

To view the contents of a project, click the project name and you will enter the Data Manager section.

Data Manager: Manage Tab 

The Manage section has many enhancements. Next to the Project name there is now a menu for quick access to common tasks such as Sharing projects. Also task such as Leaving or Deleting a project (tasks formerly found in Settings) are in the menu.

Data Manager

Project navigation has been enhanced with a collapsible tree as well as fully functional breadcrumbs.

Project Folder Tree

All action buttons have been consolidated to the right side of the screen (New Folder, New Workflow, Upload Data, Add Data, Copy data from project, Start Analysis)

Start Analysis Add Button

The filter bar has been redesigned and now defaults to searching within a project. 

Project Folders

Data Manager: Manage Tab: New “info” panel

We have added a new “info” panel which allows you to quickly inspect any file. The info panel can be opened by clicking the “i” icon in the upper right. You can then select any item, or multiple items, to display their properties. This shows information previously found in the info pop up window.

You can easily copy a file or path ID from this side panel by clicking the copy icon. 

Project Info Panel

Tables are now paginated.

New folders will now be created and show at the top of the list. After you enter a name, it will move to the appropriate place in the sort order.

Project Folder Rename

Projects can be renamed in the Settings tab or in the Info panel on the main project list page.

Data Uploading

The data uploading dialog has been split into three discrete functions.

Data Uploader

The ability to apply tags and properties in the dialog has been removed. Tags and properties can easily be added in the new Info panel. Select multiple items to apply the same tag or property to multiple objects at once.

Note: Settings & Visualize tabs are the same with minor visual updates.

Featured Projects

The Featured projects section has been removed and the projects in Resources have been updated to have useful reference files and demo data.

Featured Projects Section

Data Selection Dialog

The Data Selection dialog appears when you copy data, select projects or paths during app runs, and select inputs for workflows. It now allows you to filter, add/remove columns, and view pinned items.

Platform Data Selection

Within a project, filters allow for easier filtering to find files.

Platform Folder Selection

Start Analysis 

The start analysis dialog has enhanced to improve usability. It has the familiar filtering mechanism to easily locate an app, applet workflow or global workflow. You can now see the category and the author of the analysis tool. By selecting a row and opening the “i” Info panel, you can still view the inputs and outputs. 

Starting Analysis

Version can be toggled in the dropdown:

Allele Frequency Calculator

Open the tools details to see full information provided by the tool developer.

Tool Details Button

Tool Runner

The Tool Runner has been enhanced with a graphical representation of the analysis process. Each app or workflow has three areas that a you can configure: Settings, Analysis Inputs, Stage Settings.

Settings includes execution name, project location, output folder and optional advanced features.

Analysis inputs is where you can select appropriate inputs and can toggle to batch mode. Also, you can now view all inputs in one location.

Stage Settings contains information about each stage of a workflow, including app version, instance type and output folder. You can change these as desired.

Tool Runner

Data Manager: Monitor Tab

The Monitor Tab also has a fresh look with updated filter bar UI and action buttons consolidated to the right side. 

Monitor Tab

On the monitor details page, several actions have been moved or temporarily removed.

  • View info in not currently visibility (coming in our next release)
  • View input is removed as the inputs and outputs all shown on the page details.
  • Save as New Workflow has been removed.
  • Monitor tab is not indicating that a job is running (coming in next release)
  • Tags not showing on the Monitor table (coming in next release as its own column)

You can copy an execution ID to the clipboard by clicking the icon next to the ID.

Logs have a new Download button feature: 

Logs Download Feature