Doctor of Philosophy
DNA sequencing technologies have advanced into the realm of big data due to frequent and rapid developments in biologic medicine. This has caused a surge in the necessity of efficient and highly scalable algorithms.This dissertation focuses on central work in read-to-reference alignments, resequencing studies, and metagenomics that were designed with these principles as the guiding reason for their construction.First, consider the computing intensive task of read-to-reference alignments, where the difficulty of aligning reads to a genome is directly related their complexity. We investigated three different formulations of sequence complexity as viable tools for measuring genome complexity along with how they related to short read alignments and found that repeat measures of complexity were best suited for this task. In particular, the fraction of distinct substrings of lengths close to the read length was found to correlate very highly to alignment accuracy in terms of precision and recall. All this demonstrated how to build models to predict accuracy of short read aligners with predictably low errors. As a result, practitioners can select the most accurate aligners for an unknown genome by comparing how different models predict alignment accuracy based on the genomes complexity. Furthermore, accurate recall rate prediction may help practitioners reduce expenses by using just enough reads to get sufficient sequencing coverage.Next, focus on the comprehensive task of resequencing studies for analyzing genetic variants of the human population. By using optimal alignments, we revealed that the current variant profiles contained thousands of insertion/deletion (INDEL) that were constructed in a biased manner. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that either strongly agreed or disagreed with reported INDELs. This finding suggests that the agreement or disagreement between the aligners called INDEL and the reported INDEL is merely a result of the arbitrary selection of an optimal alignment. Also of note is LongAGE, a memory efficient of Alignment with Gap Excision (AGE) for defining geneomic variant breakpoints, which enables the precise alignment of longer reads or contigs that potentially contain SVs/CNVs while having a trade off of time compared to AGE.Finally, consider several resource-intensive tasks in metagenomics. We introduce a new algorithmic method for detecting unknown bacteria, those whose genomes have not been sequenced, in microbial communities. Using the 16S ribosomal RNA (16S rRNA) gene instead of the whole genomes information is not only computational efficient, but also economical; an analysis that demonstrates the 16S rRNA gene retains sufficient information to allow us to detect unknown bacteria in the context of oral microbial communities is provided. Furthermore, the main hypothesis that the classification or identification of microbes in metagenomic samples is better done with long reads than with short reads is iterated upon, by investigating the performance of popular metagenomic classifiers on short reads and longer reads assembled from those short reads. Higher overall performance of species classification was achieved simply by assembling short reads.These topics about read-to-reference alignments, resequencing studies, and metagenomics are all key focal points in the pages to come. My dissertation delves deeper into these as I cover the contributions my work has made to the field.
Dissertation or thesis originally submitted to ProQuest
Tran, Quang, "Algorithmic methods for large-scale genomic and metagenomic data analysis" (2020). Electronic Theses and Dissertations. 2972.