Analysis of optimal alignments unfolds aligners' bias in existing variant profiles


Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners' called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.

Publication Title

BMC Bioinformatics