Date of Award
Doctor of Philosophy
Lih Yuan Deng
With advent of new technologies, more data is being collected than ever before. Data is pouring in from every conceivable direction: from operational and transactional systems, from Micro array experiments and Genome Wide Association Studies, from inbound and outbound customer contact points, from mobile media and the Web to mention a few. Researchers and investigators in many fields are faced with the problem of identifying important effects among thousands of variables in high dimensional data sets. This process often results in non or weekly identified effects. Nowadays a common problem when processing data sets with large number of variables compared to small sample sizes is to estimate the parameters associated with each variable. When the number of variables far exceeds the number of samples, the parameter estimation becomes very difficult. The attempt to find important variables deriving different phenomena based on single variable analysis is more likely to not give a comprehensive picture due to complexity of the phenomena and presence of several predictors with potentially significant effects. Thus, methods based on single variable analysis are too simple to give a comprehensive picture of phenotype architecture. Therefore, more statistically challenging models which are able to accommodate simultaneous analysis of a large number of variables despite small sample sizes are essential in these cohorts.In this thesis, we developed several novel methods for sample classification, prediction and feature extraction in cohorts with large number of variables compared to small sample sizes using Bayesian shrinkage methods as well as non-parametric methods such as Support Vector Machines and Random Forests. We utilized Generalized Double Pareto and Double Exponential prior distributions on parameters in Bayesian Generalized Linear Models setting. These distributions have a spike at zero shrinking the parameters towards zero which imposes sparsity in the model. We utilized Markov Chain Monte Carlo (MCMC) method based on Gibbs sampling algorithm to estimate the parameters. The models were applied to Microarray data sets such as prostate cancer, leukemia, and breast cancer cohorts. In order to obtain more robust results 50 resampling on train and test data was performed and average performance of the models in 50 runs were reported. We investigated the classification accuracy, feature extraction ability, and prediction ability of the models. Based on our findings, the Bayesian hierarchical models developed obtain high classification accuracy as well as result in more cohesive variable sets compared to other common methods used for the same purpose. We show that using few predictors obtained from our models, we achieve higher performance compared to other competitive methods. We also investigated the use of literature to aid the selection of initial predictors used in the model. Our finding suggests that even though in some instances use of literature will result in better prediction and classification, this is not unanimously true and in some cases it results in poorer performance. This is mainly due to the fact that literature based predictor sets can be weak signals in the data set at hand as well as our information about the variables deriving different phenomena based on literature is not fully complete. Ideally, we would like to use literature to tune and prioritize signals directly coming from the experiment. To this end, we developed a literature aided sparse Bayesian Generalized linear model that uses literature information a priori to guide the choice of hyper parameters and amount of shrinkage imposed in the model. The developed model not only achieves high classification accuracy, sensitivity, and specificity but also, results is substantially more relevant genesets which turns out to explain the underlying mechanisms of phetotypes better.
dissertation or thesis originally submitted to the local University of Memphis Electronic Theses & dissertation (ETD) Repository.
Madahian, Behrouz, "Statistical Shrinkage Methods for Classification, Prediction, and Feature Extraction Using Genomewide Gene Expression Data and Small Sample Sizes" (2015). Electronic Theses and Dissertations. 1204.