Efficient Approximations for Learning Phylogenetic HMM Models from Data

MSR-TR-2003-62 |

We consider models useful for learning an evolutionary or phylogenetic tree from data consisting of DNA sequences corresponding to the leaves of the tree. In particular, we consider a general probabilistic model described in [16] that we call the phylogenetic-HMM model which generalizes the classical probabilistic models of Neyman and Felsenstein. Unfortunately, computing the likelihood of phylogenetic-HMM models is intractable. We consider several approximations for computing the likelihood of such models including an approximation introduced in [16], loopy belief propagation, and several variational methods. We demonstrate that, unlike the other approximations, variational methods are accurate and are guaranteed to lower bound the likelihood. In addition, we find that the variational approximation that performs best is the one whose q distribution corresponds to the classic Neyman–Felsenstein model. The application of our best approximation to data from the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene region across nine eutherian mammals reveals a CpG effect.