Biologists push theory to experiment with the wisdom of crowds.
Seven years ago, IBM Research scientist Dr. Gustavo Stolovitzky’s team was looking for a way to better-understand the accuracy of the biological results yielded by the network reconstruction algorithms they were developing at IBM. In other words, how could Stolovitzky improve the evaluation of their reverse engineering efforts to better understand and maybe help to solve biomedical challenges such as cancer?
More generally, all computational biologists want a clear-cut evaluation of the models they use to analyze and eventually represent biological systems. Are their techniques working? How do their techniques compare with other techniques?
Stolovitzky and collaborator Dr. Andrea Califano, now the director of Columbia University’s Initiative in Systems Biology, decided to organize DREAM – the Dialogue on Reverse Engineering Assessment and Methods – Project to crowd source the analysis of high throughput data (now so pervasive in biological research) to address important challenges in biology.
Now taking submissions for DREAM7 challenges, Stolovitzky and colleague Dr. Pablo Meyer Rojas* discuss the goals of the project and how to submit responses to this year’s challenges.
How did the DREAM Project start?
Gustavo Stolovitzky: The explosion of genomics has created the need to organize and structure the data produced to generate a coherent biological picture. DREAM was created in order to foster concerted efforts by computational and experimental biologists to understand the limitations of the models built from these high-throughput data sets.
While I conceived the DREAM project as a way to understand the accuracy of the biological results yielded by the network reconstruction algorithms (reverse engineering) we were developing at IBM, it captured a need in the community that was, so to speak, up in the air.
My long-time collaborator (and former IBMer) Andrea Califano and I organized the first meeting with the New York Academy of Sciences in 2006. After that, the project was launched as a series of annual challenges that culminate in the DREAM conference.
What is the project's overall goal?
GS: In the context of the current avalanche of genomic data, DREAM's goal is to objectively assess and enhance the quality of data-based modeling of biological systems. For example, if we know what the results of a particular analysis should be (because we have what we call the “ground truth” contained in unpublished information, not yet available to the community at large) then we can test the community to assess how close to the ground truth the results are.
This approach has many useful outcomes.
- It can find the best analytical method for a given problem, because all the methods are pitted against each other on the same data set, and under the same evaluation scheme.
- It enables a dialogue in the community about why an analytical tool may yield good or bad results.
- It fosters a synergy between theoretical, computational and experimental scientists – all of whom look at the same data from different perspectives to achieve the great goal of understanding biology.
- It can help garner evidence for or against a hypothesis because, if nobody in the community can solve a given problem predicated on a hypothesis, then the underlying hypothesis may be wrong. Conversely, if at least one member of the community solves it, then the hypothesis can be considered verified.
- The outcomes of DREAM have the potential to complement peer-reviewed research, and increase the confidence of the scientific community on biological models and algorithm reliability.
DREAM states that its “main objective is to catalyze the interaction between experiment and theory in the area of cellular network inference and quantitative model building in systems biology.” Please elaborate on this.
Pablo Meyer Rojas: The goal of systems biology is to understand the biological whole as more than the sum of the individual parts. In order to do this, we need to build comprehensive context-specific models of biological processes at the cellular or organism level, based on data inherent to the system under study.
We say that the models need to be quantitative because the ultimate goal of systems biology is to describe the behavior of biological systems based on precise measurements, and predict the response of those systems to perturbations, such as disturbances caused by disease.
These models are based on the construction of cell-maps from data describing the interactions of DNA, mRNA, proteins, drugs, etc. Networks are a succinct way to represent these interactions, and are the scaffolding from which to build the mathematical models that quantitatively implement our understanding of the biological realm.
How are the challenges chosen?
GS: It has been said that a wise man's question contains half the answer.
With DREAM we try to pose relevant and important questions (the project’s challenges) about biological problems, whose answers should be found through the analysis of complex biological data. For example, how can we predict the survival of a cancer patient based on genomics data extracted from the patient's tumor? Or, what is the therapeutic effect of a drug on a cell, given that we know the effect of the same drug on other cells?
Another important consideration is that we need to know the answer of a challenge to assess the predictions. Therefore the availability of unpublished data that can be used as ground truth to evaluate the submissions – and the willingness of the data producers to share their unpublished data – is essential.
Why use crowd sourcing?
PMR: In order to tap the wisdom of crowds, we need the crowds! Crowd sourcing is an effective way to reach out to people from a diverse set of communities as participants, to get a spectrum-wide set of methods for solving a problem.
Suppose you have a tough question for which you need an answer. You may not know the answer, and your immediate friends may not know the answer, either. But what if you could ask that same question to all your neighborhood, town, province, country or planet?
It is a bit like the “ask the audience” life-line in the game show “Who Wants to be a Millionaire”.
It is possible that someone who has the expertise happens to know the answer. But to find that person we need to tap the crowds. In the case of systems biology, crowd sourcing the solution of a challenge allows us to search among many different methodologies used to analyze the bio-data, and find the one that produces the most accurate predictions. The more participants we get, the more likely it is that if a solution exists, we will find it.
How is a "best answer" for each challenge chosen, and who chooses?
GS: Before the challenges are made public, individuals involved in the organization of a challenge (including people that generated the data) get together and decide on a scoring method based on few different metrics. Participants are then informed of how their entries will be evaluated.
Once the challenge is finished, predictions are evaluated and scores are published, along with all of the scoring methods. Only the names of the best performers are revealed, but each participant is informed of his or her own score.
Something interesting we discovered is that when we aggregate the prediction of the community, the resulting aggregate solution tends to be the best answer. This gives new meaning to the concept of the wisdom of the crowds.
How will these “best responses” be used? Do you have a past example to share?
PMR: The algorithms of the best performers can be used to generate new predictions that will be tested experimentally. For example, in DREAM5 a challenge asked for predictions to determine the affinity of synthetically generated peptides (peptides are small pieces of proteins) to antibodies (proteins that rid the body of pathogens). The algorithms from the best performers were then used to generate and test a second round of peptides that were predicted to work better together.
In another DREAM5 challenge, a community prediction of the gene regulatory network of Staphylococcus Aureus was created. It could be used to help find new antibiotics against this serious bacterial pathogen that can cause infections such as MRSA (methicillin-resistant Staphylococcus aureus).
Who can participate, and how?
GS: Anyone is invited to participate. The more diverse the community of participants, the more chance we have in finding an innovative methodology. Participants need to register here, and can choose any (or all) of the four challenges.
This year’s challenges are what we call translational, in the sense that we use basic research that can be translated into medically relevant knowledge, including areas such as breast cancer and Amyotrophic lateral sclerosis, commonly referred to as ALS or Lou Gehrig’s disease.
We also have a number of incentives for challenge participation. For example, in the prediction of progression of Lou Gehrig’s disease, the non-profit Prize4Life will award $25,000 to the best performing submission.
For all challenges, an expense-paid speaking invitation to the DREAM conference (Nov 12-16 in San Francisco) will be provided to the best performer. This year we are also partnering with the journals Open Network Biology, Science Translational Medicine and Nature Biotechnology for publication of the best performing results.
* -- Besides Stolovitzky and Meyer, IBM Research scientists Raquel Norel and Erhan Bilal are working on the DREAM Project.