Electronic Theses and Dissertations


Xianqiang Fu



Document Type


Degree Name

Master of Science


Public Health

Committee Chair

Yu Jiang

Committee Member

Hongmei Zhang

Committee Member

Chunrong Jia


Abstract As environmental data grows in complexity, machine learning presents an avenue to extract meaningful insights from such data. This study aimed to investigate the applicability and performance of various machine learning methods for multi-class classification problems, with a specific focus on complex environmental data, including Polycyclic Aromatic Hydrocarbons (PAHs). In the current study, we evaluated ten machine learning models to assess their performance in multivariate classification problems using simulation studies. The results showed that Regularized Multinomial Logistic Regression (RMLR) has higher classification accuracy when the independent variables are independent, while the Gradient Boosting Machine (GBM) outperformed others when the independent variables are highly correlated. Furthermore, the feature selection accuracy of three different methods was also evaluated. GBM and Random Forest (RF) showed a higher sensitivity compared to other methods across different data settings. Based on these findings, it appears that linear models such as RMLR and MLR may not achieve optimal performance when confronted with highly correlated independent variables. Instead, tree-based methods, such as GBM and RF, prove to be a better choice. Overall, it is crucial to choose the appropriate machine learning methods based on the complexity of environmental data and the specific requirements of the task.


Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to ProQuest


Open Access