How genome complexity can explain the difficulty of aligning reads to genomes
Background: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. Results: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. Conclusions: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.
Phan, V., Gao, S., Tran, Q., & Vo, N. (2015). How genome complexity can explain the difficulty of aligning reads to genomes. BMC Bioinformatics, 16 (17) https://doi.org/10.1186/1471-2105-16-S17-S3