TY - GEN
T1 - Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts
AU - Chen, Hao
AU - Mckeever, Susan
AU - Delany, Sarah Jane
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/8/23
Y1 - 2017/8/23
N2 - Social media sites facilitate users in posting their own personal comments online. Most support free format user posting, with close to real-Time publishing speeds. However, online posts generated by a public user audience carry the risk of containing inappropriate, potentially abusive content. To detect such content, the straightforward approach is to filter against blacklists of profane terms. However, this lexicon filtering approach is prone to problems around word variations and lack of context. Although recent methods inspired by machine learning have boosted detection accuracies, the lack of gold standard labelled datasets limits the development of this approach. In this work, we present a dataset of user comments, using crowdsourcing for labelling. Since abusive content can be ambiguous and subjective to the individual reader, we propose an aggregated mechanism for assessing different opinions from different labellers. In addition, instead of the typical binary categories of abusive or not, we introduce a third class of 'undecided' to capture the real life scenario of instances that are neither blatantly abusive nor clearly harmless. We have performed preliminary experiments on this dataset using best practice techniques in text classification. Finally, we have evaluated the detection performance of various feature groups, namely syntactic, semantic and context-based features. Results show these features can increase our classifier performance by 18% in detection of abusive content.
AB - Social media sites facilitate users in posting their own personal comments online. Most support free format user posting, with close to real-Time publishing speeds. However, online posts generated by a public user audience carry the risk of containing inappropriate, potentially abusive content. To detect such content, the straightforward approach is to filter against blacklists of profane terms. However, this lexicon filtering approach is prone to problems around word variations and lack of context. Although recent methods inspired by machine learning have boosted detection accuracies, the lack of gold standard labelled datasets limits the development of this approach. In this work, we present a dataset of user comments, using crowdsourcing for labelling. Since abusive content can be ambiguous and subjective to the individual reader, we propose an aggregated mechanism for assessing different opinions from different labellers. In addition, instead of the typical binary categories of abusive or not, we introduce a third class of 'undecided' to capture the real life scenario of instances that are neither blatantly abusive nor clearly harmless. We have performed preliminary experiments on this dataset using best practice techniques in text classification. Finally, we have evaluated the detection performance of various feature groups, namely syntactic, semantic and context-based features. Results show these features can increase our classifier performance by 18% in detection of abusive content.
KW - Abusive detection
KW - Feature selection
KW - Labelling strategy
KW - Machine learning
UR - https://www.scopus.com/pages/publications/85030978087
U2 - 10.1145/3106426.3106456
DO - 10.1145/3106426.3106456
M3 - Conference contribution
AN - SCOPUS:85030978087
T3 - Proceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017
SP - 884
EP - 890
BT - Proceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017
PB - Association for Computing Machinery (ACM)
T2 - 16th IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017
Y2 - 23 August 2017 through 26 August 2017
ER -