Electronic Theses and Dissertations

Date

2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Computer Science

Committee Chair

Vinhthuy Pham

Committee Member

Bernie Daigle

Committee Member

Thomas Watson

Committee Member

Myounggyu Won

Abstract

DNA sequencing technologies have transformed genomics, allowing for the decoding of genetic information stored within an organism's DNA. Recent advances in Next-Generation Sequencing technologies (NGS) provide great opportunities to obtain, analyze, and understand genetic information of many species. Using NGS, researchers can obtain millions fragments of DNA, known as \textit{reads}, from different environments. Metagenomics, a study of DNA from environmental samples, explores microbial taxonomy and function. Analyzing metagenomic data is an important problem as recent studies shows the relation of microbial composition to certain diseases or disorders of human health. This task can be computationally expensive since microbial communities usually consist of hundreds to thousands of environmental microbial species. In this dissertation, we introduce methods to profile and identify bacteria in metagenomic samples, two of the most fundamental tasks in metagenomic analysis. \textit{Firstly}, we introduce an efficient alignment-free approach to estimate abundances of microbial genomes in metagenomic samples. The approach is based on solving linear and quadratic programs, which are represented by Genome-Specific Markers. We compared our method against popular alignment-free and alignment-based methods. Without contamination, our method was more accurate than other alignment-free methods while being much faster than a alignment-based method. In more realistic settings where samples were contaminated with human DNA, our method was the most accurate method in predicting abundance at varying levels of contamination. We achieve higher accuracy than both alignment-free and alignment-based methods. \textit{Secondly}, we introduce a new method for representing bacteria in a microbial community using genomic signatures of those bacteria. With respect to the microbial community, the genomic signatures of each bacterium are unique to that bacterium; they do not exist in other bacteria in the community. Further, since the genomic signatures of a bacterium are much smaller than its genome size, the approach allows for a compressed representation of the microbial community. This approach uses a modified Bloom filter to store short k-mers with hash values that are unique to each bacterium. We show that most bacteria in many microbiomes can be represented uniquely using the proposed genomic signatures. This approach paves the way toward new methods for classifying bacteria in metagenomic samples. \textit{Finally}, we introduce a new method which designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. This method utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. The method was evaluated and compared with several well-established tools across various datasets. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases.

Comments

Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to ProQuest.

Notes

Embargoed until 07-17-2026

Available for download on Friday, July 17, 2026

Share

COinS