Genome phylogenetic analysis based on extended gene contents

Abstract

With the rapid growth of entire genome data, whole-genome approaches such as gene content become popular for genome phylogeny inference, including the tree of life. However, the underlying model for genome evolution is unclear, and the proposed (ad hoc) genome distance measure may violate the additivity. In this article, we formulate a stochastic framework for genome evolution, which provides a basis for defining an additive genome distance. However, we show that it is difficult to utilize the typical gene content data-i.e., the presence or absence of gene families across genomes-to estimate the genome distance. We solve this problem by introducing the concept of extended gene content; that is, the status of a gene family in a given genome could be absence, presence as single copy, or presence as duplicates, any of which can be used to estimate the genome distance and phylogenetic inference. Computer simulation shows that the new tree-making method is efficient, consistent, and fairly robust. The example of 35 microbial complete genomes demonstrates that it is useful not only to study the universal tree of life but also to explore the evolutionary pattern of genomes.

Publication Title

Molecular Biology and Evolution

Share

COinS