New 3000 Rice Genomes AWS Public Dataset – Easy Access on DNAnexus Platform

shutterstock_110850977In June we announced that DNAnexus was powering the 3000 Rice Genomes Project (3K RGP).  You can read the press release here and blog here.   The project, a partnership between the Chinese Academy of Agricultural Sciences (CAAS), the International Rice Research Institute (IRRI), and BGI in China along with their numerous collaborators globally, are attempting to help feed the world’s growing population.  Rice is a diet staple for half of the world’s human population.  It is estimated that the production of rice must increase by at least 25% by 2030 in order to keep up with global population growth and demand.

3K RGP partnered with DNAnexus to develop the bioinformatics pipeline to analyze the sequence data of 3,000 different rice varieties against five published draft genomes. Performing the analysis on the DNAnexus Platform allowed them to leverage the scalable computing capability at AWS to process more than 100 TB of source genomic data across 37,000 concurrent compute cores in just two days — more than 200 times faster than would have been possible on local computing infrastructure. Located across the globe in 10 countries, the 3K RGP investigators were able to access results and collaborate in real time. The result has been the identification of hundreds of new genetic markers, each a potential pathway to improving outcomes for rice production.

Today we are happy to share with you that AWS has made available the genomic analysis data of 3,000 rice varieties as an AWS Public Dataset. The data contains over 30 million genetic variations spanning across all known and predicted rice genes. By making this dataset public, AWS hopes to accelerate research efforts and breeding programs. Knowing the genetic makeup of a rice variety will allow researchers to identify critical genetic markers related to specific phenotypic traits. With this information breeders will be able to make more intelligent choices in variety selection for cross breeding, resulting in more rapid development of rice varieties of higher nutritional content, or improved climate stress tolerance and disease resistance.

DNAnexus has made it easy to access Amazon public resources on the platform. Documentation for accessing the AWS 3K RGP public dataset can be found in the DNAnexus wiki. In addition, the 3K RGP analytical tools and pipelines used to produce the results are available on the DNAnexus Platform listed as a featured project: ‘3000 Rice Genomes’.

Big Data Rice Research Helps to Feed the World

riceSupported by grants from the Bill and Melinda Gates Foundation and the Chinese Ministry of Science and Technology, the 3000 Rice Genomes Project (3K RGP) is a multi-agency project directed at acquiring genomic information needed to accelerate rice breeding programs. Led by the Chinese Academy of Agricultural Science (CAAS), the International Rice Research Institute (IRRI) and BGI, researchers hope to enable the rapid development of higher-yielding, drought-tolerant, pest- or disease-resistant strains of rice with high nutritional value. The urgency and importance of this project would be difficult to overstate, considering that the production of rice must increase by at least 25% by 2030 in order to keep up with global population growth and demand. Without these new strains, food shortages could affect half of the world’s human population that depends on rice as a dietary staple.

A Fast Start And A Temporary Delay
Last year, the 3K RGP completed the sequencing of 3,000 rice genomes and released the project’s initial 13 TB dataset. The team then encountered a bottleneck consisting of two related issues. First, the project’s local computing infrastructure was not capable of managing the massive analysis load required to process the dataset within an acceptable timeframe. Second, the 3K RGP calls for widespread distribution of rice genome sequencing data and analysis results in order to stimulate collaboration among the global rice genome research community. Conventional methods of distribution, such as shipping hard drives between the collaborators are not feasible at this scale.

The Solution That Saved More Than A Year
Working with DNAnexus, the 3K RGP was able to deploy a rapid solution to analyze the 3,000 rice genomes dataset and generate more than 100 TB of useful data, without any of the costs or delays typically involved in purchasing and bringing new infrastructure online. Taking advantage of the computing capability of 37,000 compute cores working together, the DNAnexus genome informatics platform completed sequence mapping and variant calling in just two days — more than 200 times faster than would have been possible on local computing infrastructure. This cloud-based solution also solved the issue of data distribution by providing immediate access to data and analysis results, and enabling real-time collaboration among 3K RGP investigators worldwide, who have already discovered hundreds of new genetic markers.

Practical Solutions For The Whole World
Each of these new genetic markers has the potential to be linked to valuable traits that can improve the nutrition, climate tolerance, and disease resistance of new rice varieties. Significantly, this new genetic information generated by the 3K RGP can be used to accelerate the development of new strains using highly efficient cross-breeding, a centuries old technique but now informed by “big data,” rather than direct genetic modification. The result would be robust, high-yielding strains of rice, without the concerns surrounding genetically modified organisms (GMO) from commercial, political or public opinion stakeholders.

We are thrilled to be powering the 3000 Rice Genomes Project, a collaboration that is tackling one of the most exciting opportunities to improve human wellbeing with big data and genomics. DNAnexus is proud to be bringing it all together.