ActivePCA: A Novel Framework Integrating PCA and Active Machine Learning for Efficient Dimension Reduction

In medical data analysis, addressing challenges from high-dimensional datasets is crucial due to issues related to computational complexity, resource utilization, and model interpretability. Principal Component Analysis (PCA), a prevalent dimension reduction technique, aims to tackle these challenge...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC) pp. 320 - 325
Main Authors Bhyregowda, Priyanka, Masum, Mohammad, Mamudu, Lohuwa, Chowdhurv, Mohammed, Kosaraiu, Sai Chandra, Shahriar, Hossain
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.07.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In medical data analysis, addressing challenges from high-dimensional datasets is crucial due to issues related to computational complexity, resource utilization, and model interpretability. Principal Component Analysis (PCA), a prevalent dimension reduction technique, aims to tackle these challenges by transforming high-dimensional data into a lower-dimensional representation while preserving maximum variance. However, PCA faces limitations in high-dimensional contexts, potentially leading to information loss and increased computational demands, particularly for sizable datasets, as PCA uses the entire dataset in the transformation process. In this paper, we propose a novel framework ActivePCA that integrates PCA and Active Machine Learning (AML) to leverage a subset of datasets in the dimension reduction process. The framework selectively identifies most informative instances from the dataset in the first step. In the second step, ActivePCA applies PCA on the selected subset of the dataset only. To demonstrate effectiveness, we applied our proposed framework to six different EHR datasets with varying dimensions. The framework significantly reduces both the number of observations and dimensions of datasets utilizing AML and PCA, respectively, resulting in improved performance from ML classifiers. ActivePCA approximately reduces 50% to 80% labeling cost on the EHR datasets compared to the original dimensions of the datasets. In addition, ActivePCA achieves significantly higher accuracy using the reduced dimensions, showing the effectiveness of AML while applying PCA.
ISSN:2836-3795
DOI:10.1109/COMPSAC61105.2024.00052