Date of Award
Doctor of Philosophy
Big data is known for 5V's: (1) "volume" with huge quantity/amount (large n) and/or large number of variables (large p), (2) "variety" with various type, nature, and format, (3) "velocity" with ultra-high speed of data generation/collection, (4) "veracity" for its trustworthiness and quality of big data, and (5) "value" for its insights, usefulness and impact. Current computational resources, traditional methodologies and techniques are hard to keep up with the extraordinary volume of data being generated. Therefore, it is challenging to extract useful information from the big data with current computational resources. In this dissertation, we propose procedures to address some of the issues raised with several strategies for some modern variable selection procedures. In particular, we are evaluating various procedures (1) random sub-sampling so that the sub-data will be "similar" to the original big data, (2) random rows partitions so that "all data" will be included, (3) random columns partitions to reduce the dimension size for "feasible" model building and/or variable selection while "all columns" can be included, (4) random matrix partitions is a natural extension using both "row partition" and "column partition". Results from each proposed procedure can be combined via some ensemble methods.In aging biomarker study, methylation of cytosine residues of cytosine-phosphate-guanine dinucleotides (CpGs) shows strong associations with aging. Several such epigenetic clocks are proposed in the literature. Hannum clock (2013) with 71 CpGs Horvath clock (2013) with 353 CpGs, Levine clock (2015), and GrimAge clock (2019) with 1,030 CpGs. We will demonstrate that our proposed procedures can be useful in this research area to build a simpler but useful model for ultra-high dimension data. In our study, a total of 2640 SJLIFE participants of European ancestry were included, consisting of 2112 SJLIFE childhood cancer survivors as training data and a separate 528 cancer survivors as validation data. The data includes 689,414 CpGs. This is a clear example of large p (p=689,414) and the sample size n is much smaller. We demonstrate that we can indeed develop a new DNA methylation-based epigenetic clock with much smaller of CpG sites using the proposed procedures.
Dissertation or thesis originally submitted to ProQuest
Li, Zhenghong, "Variable Selection In Big Data With Applications To Develop A New Epigenetic Clock" (2021). Electronic Theses and Dissertations. 2642.