The FDA, the Global Alliance for Genomics and Health (GA4GH) and National Institute for Standards and Technology (NIST) recently teamed up to create a once-in-a-blue-moon challenge for genomic scientists! Dubbed the precisionFDA Truth Challenge, genomic innovators were invited to test their informatics pipelines on two datasets, the well-characterized Genome in a Bottle’s (GiaB) NA12878 (HG001) reference sample and a new reference sample HG002, of which the results were unknown.
PrecisionFDA is an online, cloud-based, virtual research space where members of the genomics community can experiment, share data and tools, collaborate, and define standards for evaluating analytical pipelines. Community members span academia, industry, healthcare organizations and government. All of these organizations are working together to further innovation and develop regulatory science around NGS tests. So far, the community currently includes more than 1,500 users across 600 organizations, with more than 10 terabytes of genetic data stored.
This is the second challenge issued through precisionFDA, following the precisionFDA Consistency Challenge. The Truth Challenge is about discovering the consistency and accuracy of informatics pipelines when analyzing a human sample whose truth data is unknown. NIST and GiaB released the truth data May 26, 2016, after the close of the challenge.
What makes this challenge so exciting?
NIST released NA12878 in 2014, the first gold standard whole human reference genome, in collaboration with GiaB and the FDA. Since then, it has arguably become one of the most studied biospecimens. Researchers from around the world use NA12878 as training data for assessing pipeline performance.
Since many pipelines use some sort of machine learning algorithm when trying to determine whether a reported mutation is real or not, the difficulty that arises is ensuring a pipeline doesn’t overfit the training data. Pipelines can ultimately be tuned, in order to maximize performance on the training dataset, and if the test data happens to be similar to the training data the pipeline’s performance would be abnormally consistent and accurate. A great resource in understanding why scientists split data into train and test roles in order to assess the accuracy, reliability, and credibility of their predictive models (the algorithm that goes into a pipeline) can be found here.
In order to test performance of pipelines in real-life, scientists needed a second reference sample and associated truth callset of which NGS pipelines have not been trained on. This is exactly what NIST and GiaB have provided in reference sample HG002.
Scientists can now evaluate algorithms using test data that is separate from the training data, an attribute that is broadly accepted as fundamental to the evaluation methodology. Moreover, unlike NA12878, the new reference sample HG002 is male, which poses new challenges to algorithms since there is only one copy of the X chromosome, and brings new opportunity for evaluating NGS methods along this dimension.
The winners
As the clock struck midnight EST on May 25, 2016, the precisionFDA Truth Challenge closed with 36 entries across 21 teams, spanning 5 countries; truly an international competition of epic proportions!
The winners of the Truth Challenge will be announced at the upcoming Festival of Genomics in Boston on June 29th at 8:45am EST by Elizabeth Mansfield, PhD, Deputy Director for Personalized Medicine in FDA’s Center for Devices and Radiological Health’s Office of In Vitro Diagnostics and Radiological Health. Registration is free. Want to learn more about precisionFDA? Stop by the DNAnexus booth (# 240) during the Festival to receive a demo of the precisionFDA platform from a member of the precisionFDA team.
Want to recreate the Truth Challenge for yourself? Join the precisionFDA community today and evaluate a pipeline of your choice against HG002. Happy testing!