TY - GEN
T1 - Harnessing the power of text mining for the detection of abusive content in social media
AU - Chen, Hao
AU - McKeever, Susan
AU - Delany, Sarah Jane
N1 - Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - The issues of cyberbullying and online harassment have gained considerable coverage in the last number of years. Social media providers need to be able to detect abusive content both accurately and efficiently in order to protect their users. Our aim is to investigate the application of core text mining techniques for the automatic detection of abusive content across a range of social media sources include blogs, forums, media-sharing, Q&A and chat—using datasets from Twitter, YouTube, MySpace, Kongregate, Formspring and Slashdot. Using supervised machine learning, we compare alternative text representations and dimension reduction approaches, including feature selection and feature enhancement, demonstrating the impact of these techniques on detection accuracies. In addition, we investigate the need for sampling on imbalanced datasets. Our conclusions are: (1) Dataset balancing boosts accuracies significantly for social media abusive content detection; (2) Feature reduction, important for large feature sets that are typical of social media datasets, improves efficiency whilst maintaining detection accuracies; (3) The use of generic structural features common across all our datasets proved to be of limited use in the automatic detection of abusive content. Our findings can support practitioners in selecting appropriate text mining strategies in this area.
AB - The issues of cyberbullying and online harassment have gained considerable coverage in the last number of years. Social media providers need to be able to detect abusive content both accurately and efficiently in order to protect their users. Our aim is to investigate the application of core text mining techniques for the automatic detection of abusive content across a range of social media sources include blogs, forums, media-sharing, Q&A and chat—using datasets from Twitter, YouTube, MySpace, Kongregate, Formspring and Slashdot. Using supervised machine learning, we compare alternative text representations and dimension reduction approaches, including feature selection and feature enhancement, demonstrating the impact of these techniques on detection accuracies. In addition, we investigate the need for sampling on imbalanced datasets. Our conclusions are: (1) Dataset balancing boosts accuracies significantly for social media abusive content detection; (2) Feature reduction, important for large feature sets that are typical of social media datasets, improves efficiency whilst maintaining detection accuracies; (3) The use of generic structural features common across all our datasets proved to be of limited use in the automatic detection of abusive content. Our findings can support practitioners in selecting appropriate text mining strategies in this area.
UR - http://www.scopus.com/inward/record.url?scp=84988646719&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-46562-3_12
DO - 10.1007/978-3-319-46562-3_12
M3 - Conference contribution
AN - SCOPUS:84988646719
SN - 9783319465616
T3 - Advances in Intelligent Systems and Computing
SP - 187
EP - 205
BT - Advances in Computational Intelligence Systems - Contributions Presented at the 16th UK Workshop on Computational Intelligence, 2016
A2 - Gegov, Alexander
A2 - Jayne, Chrisina
A2 - Shen, Qiang
A2 - Angelov, Plamen
PB - Springer Verlag
T2 - 16th UK Workshop on Computational Intelligence, UKCI 2016
Y2 - 7 September 2016 through 9 September 2016
ER -