Electronic Theses and Dissertations
Date
2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy
Department
Public Health
Committee Chair
Xichen Mou
Committee Member
Hongmei Zhang
Committee Member
Yu Jiang
Committee Member
Ching-Chi Yang
Abstract
High-dimensional data, such as DNA methylation data, refers to datasets where the number of variables (features) exceeds the number of samples by a significant margin. In our research, our objective is to utilize advanced biostatistical methods to accurately estimate and predict outcomes from these complex datasets. In the first project, we aim to identify Differentially Methylated Regions (DMRs) within the human genome using a novel biostatistical method. These genomic regions or specific positions exhibit distinct methylation patterns across various phenotypes. Despite existing methodologies like EWAS and dmrff, challenges such as low statistical power, high false positive rates, and complexities in confounder control persistently hinder progress in this field. To address these issues, our research focuses on developing an innovative approach using the Generalized Beta distribution, which effectively models DNA methylation data and accounts for correlation patterns through shared parameters. Inspired by the unique characteristics of DNA methylation, our method demonstrates significant power in identifying potential biomarkers through simulation studies and real-world data analyses. In the second project, we aim to develop comprehensive prediction models for allergic diseases by integrating clinical variables with epigenetic risk factors identified through advanced feature selection methods. Asthma, characterized by varied clinical manifestations across different life stages, serves as our focal point. Our objective is to enhance predictive accuracy significantly through robust models that incorporate both clinical and epigenetic markers from DNAm profiles obtained at birth. In the third project, we explore anxiety and depression prediction using machine learning algorithms, analyzing a very large dataset. Results demonstrate the superiority of random forest over alternative methods, evidenced by comparable accuracy metrics and superior area under the curve (AUC) scores. Feature importance analysis reveals crucial interpretable predictors including children’s demographics and parental mental health. The large number of testing participants makes the model and selected features robust for prediction. The study represents the largest dataset utilized to date for predicting children’s mental health, and its features are smaller in scale compared to previously reported methods. Overall, the study underscores the potential of some predictors in mental health prediction, offering algorithm insights for research and clinical applications.
Library Comment
Dissertation or thesis originally submitted to ProQuest.
Notes
Open Access
Recommended Citation
Wu, Chengzhou, "Biostatistical Methods in High-Dimensional Estimation and Prediction Problems" (2024). Electronic Theses and Dissertations. 3618.
https://digitalcommons.memphis.edu/etd/3618
Comments
Data is provided by the student.