TY - JOUR
T1 - Multivariate Feature Ranking With High-Dimensional Data for Classification Tasks
AU - Jimenez, Fernando
AU - Sanchez, Gracia
AU - Palma, Jose
AU - Miralles-Pechuan, Luis
AU - Botia, Juan A.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - In many machine learning classification problems, datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes, eliminating the redundant and irrelevant ones. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods are not very suitable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, which do not detect interactions between factors. In this paper, we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which have been applied for cancer gene expression and genotype-tissue expression classification tasks using public datasets. We statistically proved that the proposed methods outperform the state-of-the-art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as other feature selection methods for attribute subset evaluation based on correlation and consistency with the multi-objective evolutionary search strategy, and with the embedded feature selection methods C4.5 and LASSO. The proposed methods have been implemented on the WEKA platform for public use, making all the results reported in this paper repeatable and replicable.
AB - In many machine learning classification problems, datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes, eliminating the redundant and irrelevant ones. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods are not very suitable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, which do not detect interactions between factors. In this paper, we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which have been applied for cancer gene expression and genotype-tissue expression classification tasks using public datasets. We statistically proved that the proposed methods outperform the state-of-the-art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as other feature selection methods for attribute subset evaluation based on correlation and consistency with the multi-objective evolutionary search strategy, and with the embedded feature selection methods C4.5 and LASSO. The proposed methods have been implemented on the WEKA platform for public use, making all the results reported in this paper repeatable and replicable.
KW - classification
KW - consistency
KW - correlation
KW - feature ranking
KW - feature selection
KW - High-dimensional data
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85131740381&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3180773
DO - 10.1109/ACCESS.2022.3180773
M3 - Article
AN - SCOPUS:85131740381
SN - 2169-3536
VL - 10
SP - 60421
EP - 60437
JO - IEEE Access
JF - IEEE Access
ER -