TY - JOUR
T1 - Addressing Class Imbalance in Healthcare Data
T2 - Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia
AU - Martinez-Velasco, Antonieta
AU - Martinez -Villasenor, Lourdes
AU - Miralles-Pechuan, Luis
N1 - Publisher Copyright:
© 2003-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - The use of machine learning in healthcare has transformed the way diseases are diagnosed and treatments are optimized. However, medical databases often lack balanced data due to challenges in data collection caused by privacy regulations. Certain health conditions are under represented, which hampers machine learning performance. To address this problem, a hybrid approach has been proposed that combines the Synthetic Minority Oversampling Technique (SMOTE) with under sampling and uses two specific techniques tailored for imbalanced datasets. Comparative evaluations were conducted using various thresholds to reduce one class and employingBalanced Accuracy to mitigate bias toward the majority class, with popular machine learning methods. The results showed that Balanced Bagging and Balanced Random Forest consistently outperformed other methods, performing the best with an average ranking of 1.42 and 3.58 out of 32 configurations in the two datasets, respectively. Tree-based approaches such as Random Forest and Gradient Boosting demonstrated similar effectiveness, emphasizing the power of aggregating predictions from multiple trees to reduce bias. Notably, under sampling andSMOTE proved advantageous for non-tree-based models likeKNN, SVM, and Logistic Regression showcasing their usefulness across different algorithms. This study provides a robust solution for handling imbalanced datasets in healthcare, which could potentially optimize healthcare interventions and improve patient outcomes and care.
AB - The use of machine learning in healthcare has transformed the way diseases are diagnosed and treatments are optimized. However, medical databases often lack balanced data due to challenges in data collection caused by privacy regulations. Certain health conditions are under represented, which hampers machine learning performance. To address this problem, a hybrid approach has been proposed that combines the Synthetic Minority Oversampling Technique (SMOTE) with under sampling and uses two specific techniques tailored for imbalanced datasets. Comparative evaluations were conducted using various thresholds to reduce one class and employingBalanced Accuracy to mitigate bias toward the majority class, with popular machine learning methods. The results showed that Balanced Bagging and Balanced Random Forest consistently outperformed other methods, performing the best with an average ranking of 1.42 and 3.58 out of 32 configurations in the two datasets, respectively. Tree-based approaches such as Random Forest and Gradient Boosting demonstrated similar effectiveness, emphasizing the power of aggregating predictions from multiple trees to reduce bias. Notably, under sampling andSMOTE proved advantageous for non-tree-based models likeKNN, SVM, and Logistic Regression showcasing their usefulness across different algorithms. This study provides a robust solution for handling imbalanced datasets in healthcare, which could potentially optimize healthcare interventions and improve patient outcomes and care.
KW - Class Imbalance
KW - Diagnostic Decision-Making
KW - Ensemble Classifiers
KW - Healthcare Domain
KW - Machine Learning Techniques
KW - Personalized Medicine
UR - https://www.scopus.com/pages/publications/85206477096
U2 - 10.1109/TLA.2024.10705995
DO - 10.1109/TLA.2024.10705995
M3 - Article
AN - SCOPUS:85206477096
SN - 1548-0992
VL - 22
SP - 806
EP - 820
JO - IEEE Latin America Transactions
JF - IEEE Latin America Transactions
IS - 10
ER -