New ENCODE Paper Reveals Remarkable Chromatin Diversity at Regulatory Elements


Today marks a major milestone for the ENCODE consortium! More than 30 papers will be published today in Nature, Genome Biology, and Genome Research from teams of scientists working on various facets of the project.


One of those, a publication in Genome Research, reports on a surprising level of heterogeneity among patterns of chromatin modifications as well as nucleosome positioning around regulatory elements such as transcription factor binding sites in the human genome. In the past, these genomic elements have been studied primarily by averaging patterns of chromatin marks across populations of sites, leading to the perception that patterns were much more uniform. The nucleosome positioning sequence data mapping and analysis was performed on the DNAnexus platform.


Lead author Anshul Kundaje was a postdoc for Serafim Batzoglou and Arend Sidow at Stanford University during the project reported in the paper. Now a research scientist at MIT, Kundaje says the work was an integral part of the ENCODE consortium’s efforts to elucidate functional elements in the human genome. The scientists looked at 119 human transcription factors and regulatory proteins to better understand how nucleosomes are positioned and how histone modifications are made around binding sites. In the paper, the authors report that asymmetry of nucleosome positioning and histone modifications is the rule, rather than the exception.


Kundaje and his colleagues relied on ChIP-seq data for the 119 transcription factors in a variety of cell types, with corresponding data for histone modifications. They also generated similar data for nucleosome positioning. To improve accuracy, the team sequenced extremely deeply, ultimately generating some 5 billion reads on the SOLiD sequencing platform. “The data sets were incredibly massive,” Kundaje says. “Processing these data sets locally was quite a challenge.” The group turned to DNAnexus, uploading their sequence files to the cloud and preprocessing the data with the company’s probabilistic mapping tool. “DNAnexus made that process incredibly simple,” he adds.


Figure 1:The mapping of the 5 billion reads was performed using the DNAnexus mapper.

Using a new tool they developed — the Clustered AGgregation Tool (CAGT) for pattern discovery — the scientists found that nucleosome positioning and histone modification at transcription factor binding sites is far more diverse than was previously thought. Rather than averaging across the regions as most studies have done, the new clustering tool was able to analyze the differences in magnitude, shapes, and orientation of the many patterns identified.


“What we found is that the results you get from the clustering approach are dramatically different from what you get by simply averaging across all types,” Kundaje says. “We found a large diversity of patterns of histone modifications as well as nucleosome positioning around almost every transcription factor binding site.”


Even the well-known and remarkably well studied transcription factor CTCF, long established as an insulator, was found to have surrounding chromatin patterns pointing to other functions throughout the genome.


Figure 2: Analysis using CAGT reveals the surprising diversity of patterns of an active chromatin mark H3K27ac around the binding sites of the CTCF protein that is well-known for its repressive insulator role.


The authors used their clustering tool to group the patterns into some 25 distinct signatures “that completely capture the diversity of all the modifications across all binding sites in a variety of cell types,” Kundaje says. The method uses ‘metapatterns’ to explain that diversity, and that information can reveal the function of these elements in context. “By accounting for combinatorial relationships between various binding events and how they affect chromatin, this gives you a more complete biological sense of what a transcription factor is doing in a cell type,” he adds.


Kundaje is already following up on this study by looking at other species to see whether the heterogeneity of modification patterns holds true in other organisms. He continues to use DNAnexus for analysis of sequencing data, especially in read mapping, quality control, and genome browsing, he says.


Using DNAnexus for the team’s ENCODE study “made the process significantly easier,” Kundaje adds, noting that the cloud provider’s direct integration of the genome browser was particularly helpful. DNAnexus allowed Kundaje and his colleagues to go from data to visualization with minimal processing steps in between, he says. “It frees up your time to focus on the more interesting work.”


For a glimpse of some of Kundaje’s data, DNAnexus has made the 20 samples available on DNAnexus in the Public Data folder, called Encode. Click here to sign up for a free account.


Check out the Kundaje et al. paper “Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.”


Seeing The Trees In The Forest

One of the biggest challenges associated with the identification of genomic variation, is finding those that have a real and measurable impact and help explain, for example, a disease or drug response under investigation. Weeding through more than 5 million variants associated with the human genome is a huge effort that requires significant computational infrastructure and staff time to manually validate and correlate the identified biological findings associated with the data obtained. To expedite this process and free up more time for focusing on relevant data, these data must be narrowed down to a manageable size – ideally less than a few hundred variants.

We have just released a number of new features that will help solve this challenge by providing:

  1. Smart variation results filtering
  2. Linkouts to public and commercial data sources with gene to disease information

With this new functionality, you can – with a few simple queries – home in on the most relevant variants, whether they are associated with a specific gene, a coding region, a specific chromosome, or annotations that fulfill a specific set of characteristics. The result is quicker insight into affected processes that directly translates into faster hypothesis generation and decision making.

More Specifically…

To help you rapidly drill down on biologically interesting and relevant results, we have created a flexible query tool for filtering your variation analysis results within the DNAnexus Genome Browser. With just a few clicks, you can apply any number of filters to a results table, yielding a set of variant calls that allow easy navigation through the browser and further investigation.

In this release, we have added 13 distinct filters, including chromosome, variant type, gene/transcript name, zygosity, location relative to gene/transcript, among others. These filters are currently available for the DNAnexus Nucleotide-Level Variation (see screenshot below) and Population Allele Frequency analyses results. We are also working towards making them available for any data type, including RNA-seq and ChIP-seq data. All of the filtered results can be exported out of DNAnexus for further analyses in other tools, such as Excel or statistical tools.

Understanding And Validating Variant To Gene To Disease Results

To help you understand a prioritized list of variants as well as the genes and processes impacted as a result of these variants, we have included the ability to link out to other third party data sources, both public and commercial data sources that contain relevant gene-to-disease knowledge, allowing you to study how identified variations in DNA affect the response to diseases, bacteria, viruses, toxins and chemicals, including drugs and other therapies.

It’s All About The Data

DNAnexus specializes in addressing the data storage, management and analysis challenges inherent in next-generation sequencing. We believe that by leveraging the cloud, being data-source/platform agnostic we can provide the best possible support for anyone using these data in their work. We also believe that your input regarding what data is accessible through DNAnexus is critical and because our platform is flexible we can easily integrate with many of the data sources you would like to access or need for your research.

DNAnexus currently supports direct linkouts to 12 public and commercial data sources including: AmiGo, BioBase, Cosmic, dbSNP, Entrez Gene, GeneCards®, IPA®, KEGG, NextBio, OMIM, PharmGKB, Pubmed. For commercial data sources, we can provide integrated access for users who have licenses to access these data.

Please let us know if there are specific data that you would like to access via DNAnexus by emailing us at

Take Me To The Data

To access these data sources we have added the new Gene Info pages (see the BRCA1 Gene Info page as an example below), which provide a gene overview and a list of all the data sources accessible. Gene Info pages are meant to give you a preview of the gene, with linkouts to additional information.

Gene Info pages are accessible through hyperlinked gene names within the DNAnexus Genome Browser and analysis results tables, as shown here.

We now support 22 reference genomes, the latest additions include Staphylococcus genome S. epidermidis ATCC 12228 and the Macaque genome M. mulatta.

Tell Us What You Think

Much of the new functionality that makes its way into the DNAnexus platform is the result of requests by our many active users. We cannot emphasize enough how much we value user feedback; it is a critical component of our product development and feature prioritization process.

To simplify the process of providing feedback, we have added feedback links to both the filterable results tables and the Gene Info pages. You are also welcome to email us at with any feature requests or questions you may have. We look forward to hearing from you and keeping you posted on the many new features we are working on and will be releasing in the coming months.