Electronic Theses and Dissertations

Author

Chengzhou Wu

Date

2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Public Health

Committee Chair

Xichen Mou

Committee Member

Hongmei Zhang

Committee Member

Yu Jiang

Committee Member

Ching-Chi Yang

Abstract

High-dimensional data, such as DNA methylation data, refers to datasets where the number of variables (features) exceeds the number of samples by a significant margin. In our research, our objective is to utilize advanced biostatistical methods to accurately estimate and predict outcomes from these complex datasets. In the first project, we aim to identify Differentially Methylated Regions (DMRs) within the human genome using a novel biostatistical method. These genomic regions or specific positions exhibit distinct methylation patterns across various phenotypes. Despite existing methodologies like EWAS and dmrff, challenges such as low statistical power, high false positive rates, and complexities in confounder control persistently hinder progress in this field. To address these issues, our research focuses on developing an innovative approach using the Generalized Beta distribution, which effectively models DNA methylation data and accounts for correlation patterns through shared parameters. Inspired by the unique characteristics of DNA methylation, our method demonstrates significant power in identifying potential biomarkers through simulation studies and real-world data analyses. In the second project, we aim to develop comprehensive prediction models for allergic diseases by integrating clinical variables with epigenetic risk factors identified through advanced feature selection methods. Asthma, characterized by varied clinical manifestations across different life stages, serves as our focal point. Our objective is to enhance predictive accuracy significantly through robust models that incorporate both clinical and epigenetic markers from DNAm profiles obtained at birth. In the third project, we explore anxiety and depression prediction using machine learning algorithms, analyzing a very large dataset. Results demonstrate the superiority of random forest over alternative methods, evidenced by comparable accuracy metrics and superior area under the curve (AUC) scores. Feature importance analysis reveals crucial interpretable predictors including children’s demographics and parental mental health. The large number of testing participants makes the model and selected features robust for prediction. The study represents the largest dataset utilized to date for predicting children’s mental health, and its features are smaller in scale compared to previously reported methods. Overall, the study underscores the potential of some predictors in mental health prediction, offering algorithm insights for research and clinical applications.

Comments

Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to ProQuest.

Notes

Open Access

Share

COinS