Repeat complexity of genomes as a means to predict the performance of short-read aligners
We investigated the extent to which the complexity of genomic sequences affects the performance of short-read aligners. We demonstrated that a proper measure of sequence complexity was essential in studying the relationship between alignment performance and the abundance of repeats in genomes. In particular, we demonstrated that popular measures of sequence complexity were not suitable and that the right measure of repeat complexity correlated strongly to the performance of many popular short-read aligners. Using genomic sequences from a diverse number of species, we observed that as repeat complexity increased, the performance of these aligners decreased proportionally. This strong negative correlation was observed in all three important aspects of alignment performance: (i) precision, (ii) accuracy and (iii) chromosomal coverage by mapped reads. We took advantage of such strong correlation to construct linear regression models that could predict accurately alignment performance based on repeat complexity without having to align millions of reads to genomes. This finding suggests a novel approach to selecting aligners for new genomes and has great potential for reducing experimental cost.
Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016
Tran, Q., Gao, S., Vo, N., & Phan, V. (2016). Repeat complexity of genomes as a means to predict the performance of short-read aligners. Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016, 135-141. Retrieved from https://digitalcommons.memphis.edu/facpubs/3144