SOT: Still Early Days for Next-Gen Sequencing in Molecular Toxicology

The Society of Toxicology’s 51st annual meeting was held this week right in our back yard. Since I am a longtime member, I headed up to the Moscone Convention Center in San Francisco to check it out. The Annual Meeting and ToxExpo were packed; almost 7,500 people and more than 350 exhibitors.

SOT isn’t like the sequencing-focused meetings I’ve been attending since I joined DNAnexus, but it’s actually home turf for my own research background in toxicogenomics. This year’s meeting sponsors included a number of pharmas and biotechs, from Novartis and Bristol-Myers Squibb to Amgen and Syngenta. Scientific themes at the conference ranged from environmental health to clinical toxicology to regulatory science and toxicogenomics. Next-gen sequencing is still in its infancy in the world of molecular toxicology, which is still dominated by microarray expression experiments. There were very few posters showing applications of NGS data in toxicogenomics — the ones that did tended to be centered around microRNAs — but a lot of the people I had conversations with have recently started running sequencing studies to eventually retire microarray type experiments.

I found Lee Hood’s opening presentation particularly interesting because he focused on the need to combine data from various technology platforms and institutions all over the world. He talked about his P4 vision, of course — the idea that medicine going forward will have to be predictive, personalized, preventive, and participatory. He also included great gems about fostering a cross-disciplinary culture, mentioning genome sequencing of families, the human proteome, and mining genomic data together with phenotypic and clinical data.

Lee Hood. Photo Copyright Chuck Fazio

Another exciting talk that was well received came from Joe DeRisi at the University of California, San Francisco. He presented work analyzing hundreds of honey bee samples with microarrays combined with DNA and RNA sequencing. Using an internally developed de novo assembler called PRICE (short for Paired-Read Iterative Contig Extension; freely available on his website), his team identified a number of different organisms associated with the sequence data of the honey bee samples, including different viruses, phorides, and parasites. At this moment it’s not clear what is causing the honey bee population decline; it appears that there are multiple factors contributing to the phenomenon. It is great to see that DeRisi and team will continue working in this area.

Last but not least, Scott Auerbach from the National Toxicology Program announced the release of the previously commercial toxicogenomics database DrugMatrix to the public for free (announced earlier this year, but now officially made public). With this release, DrugMatrix is now the largest scientific and freely available toxicogenomic reference database and informatics system. The data included is based on rat organ toxicogenomic profiles for 638 compounds; DrugMatrix allows an investigator to formulate a comprehensive picture of a compound’s potential for toxicity with greater efficiency than traditional methods. All of the molecular data stems from microarray experiments, but Auerbach and team are now investigating what it will take to move from microarrays to RNA-seq experiments and how to integrate the different types of data. They are currently performing a pilot on a subset of compounds with the same RNA used for the microarray experiments. Their challenge, as he sees it, lies in the interpretation and validation of the newly generated RNA-seq data: what qualifies one platform as superior to the other? Since they are interested in the biology and in generating drug classifiers, one way of looking at it is to assess which platform is the basis for better classifiers based on sensitivity and specificity thresholds. It will be interesting to see whether the RNA-seq data-based classifiers will be comparable or superior to microarray classifiers.

Seeing The Trees In The Forest

One of the biggest challenges associated with the identification of genomic variation, is finding those that have a real and measurable impact and help explain, for example, a disease or drug response under investigation. Weeding through more than 5 million variants associated with the human genome is a huge effort that requires significant computational infrastructure and staff time to manually validate and correlate the identified biological findings associated with the data obtained. To expedite this process and free up more time for focusing on relevant data, these data must be narrowed down to a manageable size – ideally less than a few hundred variants.

We have just released a number of new features that will help solve this challenge by providing:

  1. Smart variation results filtering
  2. Linkouts to public and commercial data sources with gene to disease information

With this new functionality, you can – with a few simple queries – home in on the most relevant variants, whether they are associated with a specific gene, a coding region, a specific chromosome, or annotations that fulfill a specific set of characteristics. The result is quicker insight into affected processes that directly translates into faster hypothesis generation and decision making.

More Specifically…

To help you rapidly drill down on biologically interesting and relevant results, we have created a flexible query tool for filtering your variation analysis results within the DNAnexus Genome Browser. With just a few clicks, you can apply any number of filters to a results table, yielding a set of variant calls that allow easy navigation through the browser and further investigation.

In this release, we have added 13 distinct filters, including chromosome, variant type, gene/transcript name, zygosity, location relative to gene/transcript, among others. These filters are currently available for the DNAnexus Nucleotide-Level Variation (see screenshot below) and Population Allele Frequency analyses results. We are also working towards making them available for any data type, including RNA-seq and ChIP-seq data. All of the filtered results can be exported out of DNAnexus for further analyses in other tools, such as Excel or statistical tools.

Understanding And Validating Variant To Gene To Disease Results

To help you understand a prioritized list of variants as well as the genes and processes impacted as a result of these variants, we have included the ability to link out to other third party data sources, both public and commercial data sources that contain relevant gene-to-disease knowledge, allowing you to study how identified variations in DNA affect the response to diseases, bacteria, viruses, toxins and chemicals, including drugs and other therapies.

It’s All About The Data

DNAnexus specializes in addressing the data storage, management and analysis challenges inherent in next-generation sequencing. We believe that by leveraging the cloud, being data-source/platform agnostic we can provide the best possible support for anyone using these data in their work. We also believe that your input regarding what data is accessible through DNAnexus is critical and because our platform is flexible we can easily integrate with many of the data sources you would like to access or need for your research.

DNAnexus currently supports direct linkouts to 12 public and commercial data sources including: AmiGo, BioBase, Cosmic, dbSNP, Entrez Gene, GeneCards®, IPA®, KEGG, NextBio, OMIM, PharmGKB, Pubmed. For commercial data sources, we can provide integrated access for users who have licenses to access these data.

Please let us know if there are specific data that you would like to access via DNAnexus by emailing us at

Take Me To The Data

To access these data sources we have added the new Gene Info pages (see the BRCA1 Gene Info page as an example below), which provide a gene overview and a list of all the data sources accessible. Gene Info pages are meant to give you a preview of the gene, with linkouts to additional information.

Gene Info pages are accessible through hyperlinked gene names within the DNAnexus Genome Browser and analysis results tables, as shown here.

We now support 22 reference genomes, the latest additions include Staphylococcus genome S. epidermidis ATCC 12228 and the Macaque genome M. mulatta.

Tell Us What You Think

Much of the new functionality that makes its way into the DNAnexus platform is the result of requests by our many active users. We cannot emphasize enough how much we value user feedback; it is a critical component of our product development and feature prioritization process.

To simplify the process of providing feedback, we have added feedback links to both the filterable results tables and the Gene Info pages. You are also welcome to email us at with any feature requests or questions you may have. We look forward to hearing from you and keeping you posted on the many new features we are working on and will be releasing in the coming months.