Detection of significant protein co-evolution
Mirrortree and related methodologies are widely used for the co-evolution based prediction of protein interactions. These methods quantify the similarity of two phylogenetic trees (as a proxy of co-evolution) as the Pearson's correlation between the corresponding distance matrices. In spite of its success, this approach have a number of mathematical and biological problems, which were circumvented in the past with different post-filters, artificial thresholds, or simply ignored. For example, any pair of trees has a background similarity (due to the underlaying speciation process) regardless the interaction of the corresponding proteins, which precludes the usage of analytical p-values derived from a null model which does not take this into account. Most of these problems are related with the lack of an adequate statistical framework to assess the significance of an observed co-evolutionary score (tree similarity) respect to those expected for unrelated proteins. We have now developed such framework and showed that most of these problems and drawbacks are solved or alleviated in a single shot. The idea is to associate confidence estimators (p-values) to the tree similarity scores using a null model specifically constructed for the tree comparison problem. This approach generates a large set of shuffled phylogenetic trees by interchanging branches taken from the real trees. The trees within this set are used to derive a background distribution of tree similarity scores that later is used to extract empirical p-values for the similarities observed for real trees. This new approach, named pMT, largely improves the quality and coverage (number of pairs that can be evaluated) of the detected co-evolution in all the stages of the mirrortree workflow, overcoming previous versions of this methodology. It allows generating a reliable and comprehensive network of predicted interactions, and provides information on the substructure of macromolecular complexes, all using genomic information only. As a side result, since the benchmarks were performed using the genomes available at different time points in the past (so as to evaluate the behavior of the methods when fed with different genomic information), this work also evaluates for the first time how the non-homogeneous exploration of the bacterial taxonomy in terms of sequenced genomes affects the detection of co-evolution, and which trends are expected for the future.
More information and links
© 2015, Computational Systems Biology Group. CNB-CSIC