Electronic Theses and Dissertations

Date

2025

Document Type

Thesis

Degree Name

Master of Science

Department

Public Health

Committee Chair

Meredith Ray

Committee Member

Ching-Chi Yang

Committee Member

Hongmei Zhang

Committee Member

Yu Jiang

Abstract

Clustering analysis is a fundamental technique in machine learning, with K-means and its variants being widely used for their interpretability and efficiency. The Vector in Partition (VIP) algorithm extends K-means by incorporating a multi-dimensional distance measure for nested genetic data structures. Still, it inherits the challenge of selecting the optimal number of clusters (k). This thesis proposes integrating the simplified silhouette score (SSI) into VIP’s optimization options to improve k selection. Through simulation studies comparing SSI with currently implemented methods (Elbow, Slope, and Minimum AIC and BIC), we demonstrate that SSI consistently performs on par or better than existing methods, particularly in datasets with distinct clusters. Across tested settings, SSI appears resilient to changes in the number of subjects, although performance is slightly reduced at larger numbers of genes. While performance declines in non-distinct datasets, the SSI remains a useful heuristic for assessing clustering solutions, with minimal additional computational cost.

Comments

Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to ProQuest.

Notes

Open Access

Share

COinS