Collaborative Research Was the Big Winner at Bio-IT World Europe


Earlier this month I attended the 4th annual Bio-IT World Europe Conference & Expo in Vienna, where I found that the enthusiasm for high-performance, cloud-based computing from the scientific community is higher than ever. I was thrilled to see that there is more demand for resources to help scientists and bioinformaticians store, manage, and analyze their data — particularly in ways that facilitate collaboration among larger groups. There seems to be quite a bit of money spent in Europe on cloud-based/open source tools with the goal to support and advance genomics research. The money comes from research funds, but also from the commercial sector. That’s especially interesting since Europe’s overall funding situation seems a bit shaky, yet it is great to see that there is enough funding on the research side.


Indeed, collaboration was a central theme at the conference this year. A keynote presentation from Yike Guo, a professor in computing science at Imperial College London, focused on the Innovative Medicines Initiative (IMI), a public-private partnership bringing together biopharmaceutical companies, hospitals, universities, and others to help bring safer, more effective medicines to patients. Guo oversees a project called eTRIKS, or the European Translational Information & Knowledge Management Services, which received €24 million from IMI to build a cloud-based platform to improve collaboration among IMI members, including big pharma companies and academic institutions. The effort is open source, and Guo’s team hopes to have a prototype ready for testing in a few months.


In another keynote session, Paul Flicek, principal investigator and head of the vertebrate genomics team at EMBL’s European Bioinformatics Institute, spoke about evaluating cloud-based computing as part of his work with Ensembl, the 1,000 Genomes Project, and ENCODE. He termed it, “interacting with the cloud through the lens of Ensembl.” Flicek made the important point that the ultimate goal isn’t amassing sequence data, such as aligned reads, variation calls, and genome browser viewings, but rather to extract knowledge from that data to improve our understanding of biology and disease. He uses cloud services from Amazon to take advantage of its entire infrastructure, to distribute the data, and to provide genome annotation of more than 50 species.


I was also really interested in a talk from Veit Ulishoefer, who presented an update on the Pistoia Alliance. This group was formed a few years ago by informatics experts at some of the leading pharmaceutical companies who wanted to share precompetitive information to streamline the drug discovery process at all of their companies. Today, the group is made up of pharma companies, publishers, and academic institutes, among others. Ulishoefer spoke about a recent competition called Sequence Squeeze, hosted by Pistoia, to find the best compression tool for sequence data. The winning entry came from James Bonfield, a researcher at the Wellcome Trust Sanger Institute, which can be accessed through SourceForge.


It was great to see that so many of these collaborative projects were driven by pharma, which isn’t necessarily known for having a share-and-share-alike mentality. If even these highly competitive corporations can find ways to work together, that gives me great hope that such alliances will help usher in an improved understanding of diseases and more effective medicines. Here at DNAnexus, we strongly believe in the central pillar of collaboration, a major focus of ours and well supported with the core capability of the cloud.


At Beyond the Genome Conference, Lessons on Data Analysis and Clinical Studies


A few of us from DNAnexus had the privilege of attending Beyond the Genome 2012, a conference organized by BioMedCentral and held at Harvard Medical School. The meeting, now in its third year, continued its trend of attracting top-notch speakers, including keynotes from Baylor’s Richard Gibbs and Stuart Schreiber from the Broad Institute.


From the first speaker, Gabor Marth from Boston College, it became clear that one of the major hurdles now facing scientists was not DNA sequencing, as has been true in years past, but processing the data. This has led to a situation where many groups are writing their own algorithms to perform the same functions — a widely recognized problem in allocating resources in the most productive way. Scientists encouraged each other to stop reinventing the wheel, and also to ensure that bioinformatics tools can be used and reported on easily by biologists. That message resonated with us, as we have long championed the concept of a central data resource where excellent algorithms would be accessible to anybody. It’s gratifying to see that the same principle is gaining acceptance throughout academia as easy-to-use, reproducible data analysis becomes the real challenge in the sequencing process.



We also saw a string of fantastic talks on clinical sequencing. Sharon Plon from Baylor gave a very insightful “lessons learned” talk about their first year of clinical exome sequencing. The biggest pain point in the process was not sequencing, data analysis, insurance reimbursement, or finding patients in need; it was figuring out what to report to patients and how to do it. This underscores the need to bring genetic counselors, ethicists, and doctors into the conversation early to give guidance on what until recently has been a purely research-based endeavor. Dr. Plon and Joris Veltman from Radboud University presented several amazing case studies where sequencing had identified the cause of disease and allowed the patient to make steps to improve their lives, as well as informing the family about risk of recurrence.  We look forward to hearing many more success stories.


Of course, cancer studies were a noteworthy trend at the conference. We heard research on cancer genome evolution, epigenetic modification, sifting causative mutations from neutral, and the general effects of genome organization in three dimensions. But it was clear that integrating the information that’s being generated from all these techniques will be a big challenge. To get even deeper insights into human cancers, we’ll need to bring together the computational tools that we’ve already built and also bring together people from different scientific, medical, and social disciplines to apply that information intelligently. The good news is that this is already starting to happen, and we at DNAnexus are excited to be in a position to offer help as this approach gains traction.

New ENCODE Paper Reveals Remarkable Chromatin Diversity at Regulatory Elements


Today marks a major milestone for the ENCODE consortium! More than 30 papers will be published today in Nature, Genome Biology, and Genome Research from teams of scientists working on various facets of the project.


One of those, a publication in Genome Research, reports on a surprising level of heterogeneity among patterns of chromatin modifications as well as nucleosome positioning around regulatory elements such as transcription factor binding sites in the human genome. In the past, these genomic elements have been studied primarily by averaging patterns of chromatin marks across populations of sites, leading to the perception that patterns were much more uniform. The nucleosome positioning sequence data mapping and analysis was performed on the DNAnexus platform.


Lead author Anshul Kundaje was a postdoc for Serafim Batzoglou and Arend Sidow at Stanford University during the project reported in the paper. Now a research scientist at MIT, Kundaje says the work was an integral part of the ENCODE consortium’s efforts to elucidate functional elements in the human genome. The scientists looked at 119 human transcription factors and regulatory proteins to better understand how nucleosomes are positioned and how histone modifications are made around binding sites. In the paper, the authors report that asymmetry of nucleosome positioning and histone modifications is the rule, rather than the exception.


Kundaje and his colleagues relied on ChIP-seq data for the 119 transcription factors in a variety of cell types, with corresponding data for histone modifications. They also generated similar data for nucleosome positioning. To improve accuracy, the team sequenced extremely deeply, ultimately generating some 5 billion reads on the SOLiD sequencing platform. “The data sets were incredibly massive,” Kundaje says. “Processing these data sets locally was quite a challenge.” The group turned to DNAnexus, uploading their sequence files to the cloud and preprocessing the data with the company’s probabilistic mapping tool. “DNAnexus made that process incredibly simple,” he adds.


Figure 1:The mapping of the 5 billion reads was performed using the DNAnexus mapper.

Using a new tool they developed — the Clustered AGgregation Tool (CAGT) for pattern discovery — the scientists found that nucleosome positioning and histone modification at transcription factor binding sites is far more diverse than was previously thought. Rather than averaging across the regions as most studies have done, the new clustering tool was able to analyze the differences in magnitude, shapes, and orientation of the many patterns identified.


“What we found is that the results you get from the clustering approach are dramatically different from what you get by simply averaging across all types,” Kundaje says. “We found a large diversity of patterns of histone modifications as well as nucleosome positioning around almost every transcription factor binding site.”


Even the well-known and remarkably well studied transcription factor CTCF, long established as an insulator, was found to have surrounding chromatin patterns pointing to other functions throughout the genome.


Figure 2: Analysis using CAGT reveals the surprising diversity of patterns of an active chromatin mark H3K27ac around the binding sites of the CTCF protein that is well-known for its repressive insulator role.


The authors used their clustering tool to group the patterns into some 25 distinct signatures “that completely capture the diversity of all the modifications across all binding sites in a variety of cell types,” Kundaje says. The method uses ‘metapatterns’ to explain that diversity, and that information can reveal the function of these elements in context. “By accounting for combinatorial relationships between various binding events and how they affect chromatin, this gives you a more complete biological sense of what a transcription factor is doing in a cell type,” he adds.


Kundaje is already following up on this study by looking at other species to see whether the heterogeneity of modification patterns holds true in other organisms. He continues to use DNAnexus for analysis of sequencing data, especially in read mapping, quality control, and genome browsing, he says.


Using DNAnexus for the team’s ENCODE study “made the process significantly easier,” Kundaje adds, noting that the cloud provider’s direct integration of the genome browser was particularly helpful. DNAnexus allowed Kundaje and his colleagues to go from data to visualization with minimal processing steps in between, he says. “It frees up your time to focus on the more interesting work.”


For a glimpse of some of Kundaje’s data, DNAnexus has made the 20 samples available on DNAnexus in the Public Data folder, called Encode. Click here to sign up for a free account.


Check out the Kundaje et al. paper “Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.”