Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Social media sites facilitate users in posting their own personal comments online. Most support free format user posting, with close to real-Time publishing speeds. However, online posts generated by a public user audience carry the risk of containing inappropriate, potentially abusive content. To detect such content, the straightforward approach is to filter against blacklists of profane terms. However, this lexicon filtering approach is prone to problems around word variations and lack of context. Although recent methods inspired by machine learning have boosted detection accuracies, the lack of gold standard labelled datasets limits the development of this approach. In this work, we present a dataset of user comments, using crowdsourcing for labelling. Since abusive content can be ambiguous and subjective to the individual reader, we propose an aggregated mechanism for assessing different opinions from different labellers. In addition, instead of the typical binary categories of abusive or not, we introduce a third class of 'undecided' to capture the real life scenario of instances that are neither blatantly abusive nor clearly harmless. We have performed preliminary experiments on this dataset using best practice techniques in text classification. Finally, we have evaluated the detection performance of various feature groups, namely syntactic, semantic and context-based features. Results show these features can increase our classifier performance by 18% in detection of abusive content.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017
PublisherAssociation for Computing Machinery (ACM)
Pages884-890
Number of pages7
ISBN (Electronic)9781450349512
DOIs
Publication statusPublished - 23 Aug 2017
Event16th IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017 - Leipzig, Germany
Duration: 23 Aug 201726 Aug 2017

Publication series

NameProceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017

Conference

Conference16th IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017
Country/TerritoryGermany
CityLeipzig
Period23/08/1726/08/17

Keywords

  • Abusive detection
  • Feature selection
  • Labelling strategy
  • Machine learning

Fingerprint

Dive into the research topics of 'Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts'. Together they form a unique fingerprint.

Cite this