Understand effective coverage by mapped reads using genome repeat complexity
Sequencing depth, which refers to the expected coverage of nucleotides by reads, is computed based on the assumption that reads are synthesized uniformly across chromosomes. In reality, read coverage across genomes is not uniform. Although a coverage of 10x, for example, means a nucleotide is covered 10 times on average, in certain parts of a genome, nucleotides are covered much more or much less. One factor that influences coverage is the ability of a read aligner to align reads to genomes. If a part of a genome is complex, e.g. having many repeats, aligners might have troubles aligning reads to that region, resulting in low coverage. We introduce a systematic approach to predict the effective coverage of genomes by short-read aligners. The effective coverage of a chromosome is defined as the actual amount of bases covered by reads. We show that the quantity is highly correlated with repeat complexity of genomes. Specifically, we show that the more repeats a genome has, the less it is covered by short reads. We demonstrated this strong correlation with five popular short-read aligners in three species: Homo sapiens, Zea mays, and Glycine max. Additionally, we show that compared to other measure of sequence complexity, repeat complexity is most appropriate. This works makes it possible to predict effective coverage of genomes at a given sequencing depth.
Proceedings of 11th International Conference on Bioinformatics and Computational Biology, BiCOB 2019
Gao, S., Tran, Q., & Phan, V. (2019). Understand effective coverage by mapped reads using genome repeat complexity. Proceedings of 11th International Conference on Bioinformatics and Computational Biology, BiCOB 2019, 65-73. Retrieved from https://digitalcommons.memphis.edu/facpubs/3302