Electronic Theses and Dissertations

Date

2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Public Health

Committee Chair

Yu Jiang

Committee Member

Hongmei Zhang

Committee Member

Meredith Ray

Committee Member

Syed Hasan Arshad

Abstract

Binary outcomes with rare events (event rate less than 5%) present significant analytical challenges, particularly in high-dimensional settings such as epigenome-wide association studies (EWAS) involving DNA methylation data. Traditional logistic regressions are often inadequate under these conditions, suffering from bias, high variance, and diminished power when the event rate is low and the number of predictors exceeds the sample size (p >> n). To address these limitations, we developed a series of novel methods that enhance sensitivity and accuracy in identifying biologically meaningful biomarkers while accounting for both event rarity and high dimensionality. We first introduce two innovative screening approaches. The Rare-Screening method incorporates bootstrap resampling with empirical Bayes adjustments to stabilize inference, while the Firth-ttScreening method applies a Firth-corrected logistic regression within a cross-validation framework. Simulation studies and application to the Isle of Wight (IOW) male birth cohort to study the association between DNAm at birth and asthma acquisition during adolescence. The results show the developed methods have much higher sensitivity comparing to the benchmark methods. Rare-Screening identified 579 CpG sites and Firth-ttScreening identified 34 CpG sites from over 450,000 CpGs, with 25 CpGs overlapping between the two as candidate biomarkers. Secondly, we introduce a novel penalty, FMCP, in the joint model to address the challenges posed by high dimensionality and rare events. FMCP integrates a log-F penalty for bias reduction in rare events and Minimax Concave Penalty (MCP) for sparsity. Simulations show FMCP consistently achieved superior sensitivity and accuracy compared to conventional MCP and LASSO methods. Out of the 25 screened CpG sites, FMCP selected 10 CpG sites, whereas the traditional MCP method failed to work at the low event rate. Finally, we propose a Bayesian hierarchical model that integrates prior biological knowledge through a log-F(1,1) prior for covariate correction and a spike-and-slab prior for variable selection. A Beta hierarchical structure enables adaptive weighting of prior inclusion probabilities. For the IOW dataset, 3 CpG sites as potential biomarkers for asthma transition. Overall, the methodological innovations in the current study provide a robust framework for analyzing rare binary outcomes in high-dimensional biological data, advancing the discovery of epigenetic biomarkers.

Comments

Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to ProQuest.

Notes

Embargoed until 02-18-2026

Available for download on Wednesday, February 18, 2026

Share

COinS