Date of Award
Master of Science
Abstract As environmental data grows in complexity, machine learning presents an avenue to extract meaningful insights from such data. This study aimed to investigate the applicability and performance of various machine learning methods for multi-class classification problems, with a specific focus on complex environmental data, including Polycyclic Aromatic Hydrocarbons (PAHs). In the current study, we evaluated ten machine learning models to assess their performance in multivariate classification problems using simulation studies. The results showed that Regularized Multinomial Logistic Regression (RMLR) has higher classification accuracy when the independent variables are independent, while the Gradient Boosting Machine (GBM) outperformed others when the independent variables are highly correlated. Furthermore, the feature selection accuracy of three different methods was also evaluated. GBM and Random Forest (RF) showed a higher sensitivity compared to other methods across different data settings. Based on these findings, it appears that linear models such as RMLR and MLR may not achieve optimal performance when confronted with highly correlated independent variables. Instead, tree-based methods, such as GBM and RF, prove to be a better choice. Overall, it is crucial to choose the appropriate machine learning methods based on the complexity of environmental data and the specific requirements of the task.
Dissertation or thesis originally submitted to ProQuest
Fu, Xianqiang, "Evaluation of Machine Learning Methods for Multivariate Classification with Application to Environmental Datasets" (2023). Electronic Theses and Dissertations. 3012.