New ENCODE Paper Reveals Remarkable Chromatin Diversity at Regulatory Elements

Today marks a major milestone for the ENCODE consortium! More than 30 papers will be published today in Nature, Genome Biology, and Genome Research from teams of scientists working on various facets of the project.

One of those, a publication in Genome Research, reports on a surprising level of heterogeneity among patterns of chromatin modifications as well as nucleosome positioning around regulatory elements such as transcription factor binding sites in the human genome. In the past, these genomic elements have been studied primarily by averaging patterns of chromatin marks across populations of sites, leading to the perception that patterns were much more uniform. The nucleosome positioning sequence data mapping and analysis was performed on the DNAnexus platform.

Lead author Anshul Kundaje was a postdoc for Serafim Batzoglou and Arend Sidow at Stanford University during the project reported in the paper. Now a research scientist at MIT, Kundaje says the work was an integral part of the ENCODE consortium’s efforts to elucidate functional elements in the human genome. The scientists looked at 119 human transcription factors and regulatory proteins to better understand how nucleosomes are positioned and how histone modifications are made around binding sites. In the paper, the authors report that asymmetry of nucleosome positioning and histone modifications is the rule, rather than the exception.

Kundaje and his colleagues relied on ChIP-seq data for the 119 transcription factors in a variety of cell types, with corresponding data for histone modifications. They also generated similar data for nucleosome positioning. To improve accuracy, the team sequenced extremely deeply, ultimately generating some 5 billion reads on the SOLiD sequencing platform. “The data sets were incredibly massive,” Kundaje says. “Processing these data sets locally was quite a challenge.” The group turned to DNAnexus, uploading their sequence files to the cloud and preprocessing the data with the company’s probabilistic mapping tool. “DNAnexus made that process incredibly simple,” he adds.

Figure 1:The mapping of the 5 billion reads was performed using the DNAnexus mapper.

Using a new tool they developed — the Clustered AGgregation Tool (CAGT) for pattern discovery — the scientists found that nucleosome positioning and histone modification at transcription factor binding sites is far more diverse than was previously thought. Rather than averaging across the regions as most studies have done, the new clustering tool was able to analyze the differences in magnitude, shapes, and orientation of the many patterns identified.

“What we found is that the results you get from the clustering approach are dramatically different from what you get by simply averaging across all types,” Kundaje says. “We found a large diversity of patterns of histone modifications as well as nucleosome positioning around almost every transcription factor binding site.”

Even the well-known and remarkably well studied transcription factor CTCF, long established as an insulator, was found to have surrounding chromatin patterns pointing to other functions throughout the genome.

Figure 2: Analysis using CAGT reveals the surprising diversity of patterns of an active chromatin mark H3K27ac around the binding sites of the CTCF protein that is well-known for its repressive insulator role.

The authors used their clustering tool to group the patterns into some 25 distinct signatures “that completely capture the diversity of all the modifications across all binding sites in a variety of cell types,” Kundaje says. The method uses ‘metapatterns’ to explain that diversity, and that information can reveal the function of these elements in context. “By accounting for combinatorial relationships between various binding events and how they affect chromatin, this gives you a more complete biological sense of what a transcription factor is doing in a cell type,” he adds.

Kundaje is already following up on this study by looking at other species to see whether the heterogeneity of modification patterns holds true in other organisms. He continues to use DNAnexus for analysis of sequencing data, especially in read mapping, quality control, and genome browsing, he says.

Using DNAnexus for the team’s ENCODE study “made the process significantly easier,” Kundaje adds, noting that the cloud provider’s direct integration of the genome browser was particularly helpful. DNAnexus allowed Kundaje and his colleagues to go from data to visualization with minimal processing steps in between, he says. “It frees up your time to focus on the more interesting work.”

For a glimpse of some of Kundaje’s data, DNAnexus has made the 20 samples available on DNAnexus in the Public Data folder, called Encode. Click here to sign up for a free account.

Check out the Kundaje et al. paper “Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.”

Congratulations to Our Newest iPad Winners!

We’re delighted to see that more and more people are signing up to keep updated on the status of the upcoming launch of our new platform. As promised, each month we have drawn a name from that growing list to be the lucky winner of a new iPad.

We have two new winners to announce:

For July, the lucky person was Marie Fahey, a bioinformatics analyst with Asuragen. And for June, the winner was Ravi Madduri, a fellow in the Computation Institute at the University of Chicago. Congratulations to both!

If you haven’t already done so, don’t forget to sign up on to be the first to know when we unveil the new platform.

At Rhode Island NGx Conference, Informatics Is Clearly on the Rise

We are back from Cambridge Healthtech Institute’s NGx Next-Generation Sequencing Data Analysis conference last week and more inspired than ever. The central theme from the 2012 meeting in Rhode Island? From listening to attendees and attending talks, the focus was on next-generation sequencing applications and their related challenges. Or more simply put, how the heck are we going to manage all this data? It appears this is the dawning of the cloud.


The plenary sessions on the first day really set the stage for the meeting and, it seemed to me, put a spotlight on the informatics and analysis issues that scientists are facing right now. Dick McCombie from Cold Spring Harbor Laboratory noted that his sequencing center is now producing about a terabase of genome sequence every month. George Weinstock, associate director of the Genome Institute at Washington University in St. Louis, said that if data output trends continue, in three years the Genome Institute will be churning out 100,000 human genomes per year. He also referred to two large data centers that were built across the street from the institute in the last several years to keep pace with the sequence being generated. Going forward, he said, continuing to add such centers at remarkably high cost will no longer be a solution.


Here at DNAnexus, we agree wholeheartedly with that sentiment. Comments like this support our core value of moving the data storage and informatics headaches into the cloud so that biologists can focus on what they actually want to do. We think that whether users are from large genome centers or small institutions, there is significant benefit to putting data in the cloud so people do not have to worry about internal data management resources and challenges.


It was also interesting to hear how significantly the balance is shifting between biology and the computational side. According to Weinstock, some 40 percent or 50 percent of the institute staff are informatics or analysis people — a major increase from the norm just a few years ago. In another talk, Vanessa Hayes from the J. Craig Venter Institute said that for every one person they have generating data, they have 10 others engaged in managing and interpreting that data. It’s quite an eye opener when you think back to the days when computational biology was a small add-on to the average wet lab.


Hearing about the increasing number of people dealing with data analysis resonates with us; after all, this is why we built the DNAnexus tool to enable collaborations. We think that a community-based approach is important for this field, so we came away from the NGx conference more convinced than ever that our cloud-based solution has a lot to offer people working with DNA sequence data. Feel free to check out our current offering now with a complimentary trial, or sign up at to receive updates as we prepare to launch our new community-centered platform soon.