TY - GEN
T1 - Multilabel Text Classification of Unbalanced Datasets
T2 - 19th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2018
AU - Skitalinskaya, Gabriella
AU - Cardiff, John
N1 - Publisher Copyright:
© 2023, Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - The natural distribution of textual data used in text classification is often imbalanced. Categories with fewer examples are under-represented and their classifiers trained on the datasets transformed to bag-of-words representations or basic topic modeling transformations often perform far below a satisfactory level. We tackle this problem using a two-pass non-negative matrix factorization algorithm. This approach finds topics for each category independently allowing to better define topics for underrepresented categories. The results are analyzed from multiple goal perspectives - H-loss, accuracy, F-measure, precision, and recall, from the micro, macro and example-based aspect since each is appropriate in different situations. Through experimental validation, it is shown that the two-pass matrix factorization improves classification results achieved using bag-of-words representations.
AB - The natural distribution of textual data used in text classification is often imbalanced. Categories with fewer examples are under-represented and their classifiers trained on the datasets transformed to bag-of-words representations or basic topic modeling transformations often perform far below a satisfactory level. We tackle this problem using a two-pass non-negative matrix factorization algorithm. This approach finds topics for each category independently allowing to better define topics for underrepresented categories. The results are analyzed from multiple goal perspectives - H-loss, accuracy, F-measure, precision, and recall, from the micro, macro and example-based aspect since each is appropriate in different situations. Through experimental validation, it is shown that the two-pass matrix factorization improves classification results achieved using bag-of-words representations.
KW - Matrix decomposition
KW - Multi-label text classification
KW - Topic modeling
UR - https://www.scopus.com/pages/publications/85149980081
U2 - 10.1007/978-3-031-23804-8_22
DO - 10.1007/978-3-031-23804-8_22
M3 - Conference contribution
AN - SCOPUS:85149980081
SN - 9783031238031
T3 - Lecture Notes in Computer Science
SP - 275
EP - 286
BT - Computational Linguistics and Intelligent Text Processing - 19th International Conference, CICLing 2018, Revised Selected Papers
A2 - Gelbukh, Alexander
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 18 March 2018 through 24 March 2018
ER -