An Information-theoretic approach to dimensionality reduction in data science


Data reduction is crucial in order to turn large datasets into information, the major purpose of data science. The classic and richer area of dimensionality reduction (DR) has traditionally been based on feature extraction by combining primary features in a linear fashion, aiming to preserve or maintain covariance/correlations between the features. Nonlinear alternatives have been developed, including information-theoretic approaches using mutual information as well and conditional entropy based on target features. Here, we further this approach to feature selection or reduction strategy based on the concept of conditional Shannon entropy of two random variables. Novel results include (a) a dimensionality reduction method based on conditional entropy between predictors themselves along two variants, disregarding the influence of the target feature; (b) an error-prevention method inspired by error-detection and correction in information theory for DR with genomic data that can be used for abiotic data as well; and (c) a comparative assessment of the performance of several machine learning models on input features selected by these methods. We assess the quality of the techniques based on their performance in solving three application problems (Malware Classification, BioTaxonomy, and Noisy Classification) of various degrees of difficulty with competitive outcomes. Some useful heuristics arise from the analysis of the results and also suggest some problems of interest for further research.

Publication Title

International Journal of Data Science and Analytics