TY - GEN
T1 - Validation of tagging suggestion models for a hotel ticketing corpus
AU - Božić, Bojan
AU - Ríos, André
AU - Delany, Sarah Jane
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/11/19
Y1 - 2018/11/19
N2 - This paper investigates methods for the prediction of tags on a textual corpus that describes hotel staff inputs in a ticketing system. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry. The paper consists of two parts: (i) exploration of existing sample data, which includes statistical analysis and visualisation of the data to provide an overview, and (ii) evaluation of tag prediction approaches. We have included different approaches from different research fields in order to cover a broad spectrum of possible solutions. As a result, we have tested a machine learning model for multi-label classification (using gradient boosting), a statistical approach (using frequency heuristics), and two simple similarity-based classification approaches (Nearest Centroid and k-Nearest Neighbours). The experiment which compares the approaches uses recall to measure the quality of results. Finally, we provide a recommendation of the modelling approach which produces the best accuracy in terms of tag prediction on the sample data.
AB - This paper investigates methods for the prediction of tags on a textual corpus that describes hotel staff inputs in a ticketing system. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry. The paper consists of two parts: (i) exploration of existing sample data, which includes statistical analysis and visualisation of the data to provide an overview, and (ii) evaluation of tag prediction approaches. We have included different approaches from different research fields in order to cover a broad spectrum of possible solutions. As a result, we have tested a machine learning model for multi-label classification (using gradient boosting), a statistical approach (using frequency heuristics), and two simple similarity-based classification approaches (Nearest Centroid and k-Nearest Neighbours). The experiment which compares the approaches uses recall to measure the quality of results. Finally, we provide a recommendation of the modelling approach which produces the best accuracy in terms of tag prediction on the sample data.
KW - K-Nearest Neighbour
KW - Multi-label Classification
KW - Natural Language Processing
KW - Tag Prediction
UR - http://www.scopus.com/inward/record.url?scp=85061118807&partnerID=8YFLogxK
U2 - 10.1145/3282373.3282386
DO - 10.1145/3282373.3282386
M3 - Conference contribution
AN - SCOPUS:85061118807
T3 - ACM International Conference Proceeding Series
SP - 15
EP - 23
BT - 20th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2018 - Proceedings
A2 - Anderst-Kotsis, Gabriele
A2 - Pardede, Eric
A2 - Steinbauer, Matthias
A2 - Indrawan-Santiago, Maria
A2 - Salvadori, Ivan Luiz
A2 - Salvadori, Ivan Luiz
A2 - Khalil, Ismail
PB - Association for Computing Machinery
T2 - 20th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2018
Y2 - 19 November 2018 through 21 November 2018
ER -