Towards a universal genomic positioning system: Phylogenetics and species identification


Technology to gather biomic data now far exceeds the capabilities of tools to extract useful information and knowledge from it, a challenging predicament facing demands in our time, such as personalized medicine. We propose a new family of data structures to represent and process omics data in a way that is more anchored in biological reality and processed by algorithms that are more consistent with it, so that DNA itself can be used to process it to extract useful knowledge, organize and store it as needed. These structures enable much more efficient crunching of genomic and proteomics data and can be used as a foundation of a truly universal Genomic Positioning System (GenIS). The power of this approach is illustrated by applications to two important problems in biology, a new universal set of biomarkers and methods to do phylogenetic analysis and species identification and classification. We show that certain metrics on these representations can be used to obtain ab initio, from genomic data alone (possibly including full genomes), in a matter of minutes or hours, well established and accepted phylogenies crafted in biology (such as the 16S rRNA-based plylogenies) in the course of the last 50 years. We also show how the same representation can also be used to solve recognition problems associated with genomic data, which includes in particular the problem of species identification and a solution to the problem of storing large genomes into compact representations while preserving the ability to query them efficiently. We also sketch other applications to be explored in the future, including objective criteria to produce biological taxonomies to produce a truly universal and comprehensive “Atlas of Life”, as it is or as it could be on earth.

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)