At the Regeneron Genetics Center (RGC), scientists are uncovering important genomic variants involved in human health and disease and enabling important research into novel drugs and therapies. RGC receives 500,000 samples per year and generates about 500 billion reads per hour. To date, the center has sequenced over 1.5 million samples and created one of the largest catalogs of human genetic variation.
To gain insights from all that data, RGC needed infrastructure capable of capturing and handling large quantities of genome and phenotype information. In a recent webinar, William Salerno, RGC’s Senior Director of Genome and Sequencing Informatics discussed the RGC’s infrastructure and how the center built a system capable of meeting its current and future data analysis needs.
The RGC accomplishes its data analysis using a combination of local infrastructure, cloud computing, and the DNAnexus Platform. RGC uses the platform to run various production and analytical workflows and for its data management and sharing needs. One component of the RGC platform is GLnexus, software that RGC researchers developed with DNAnexus and other partners that enables large-scale data merging. RGC researchers have tested it on over a million exome samples so they are confident that it scales to meet their needs.
RGC needed a solution that includes metadata capture and pipeline version control that enables extensive logging and troubleshooting. The platform provides a comprehensive security framework for keeping genomics data safe and secure to support the analysis and processing of genomics data. Salerno highlighted one example where RGC created an autonomous cloud environment for a partner that needed to analyze genomic and phenotypic data related to COVID-19. RGC was able to get the environment up and running in two days, and the partner was able to easily import data into the cloud and control who could access it.
For scientists looking to build the infrastructure that can support large-scale genomics, Salerno highlighted some key factors to consider during the webinar. There are the costs associated with the platform. This includes those for the physical infrastructure but also costs for audits, quality control, system redundancy, troubleshooting policies, managed services, and disaster recovery. Another important factor to consider is how metadata will be generated and captured on the platform.
The RGC is committed to ensuring that its work is equitable, open access and transparent. To that end they make open-source versions of the genome analysis pipelines that they use on the Titan Platform available to the scientific community.