Classifying the content of online notepad services using active learning

Mhd Wesam Al-Nabki, Eduardo Fidalgo, Enrique Alegre, Sarah Jane Delany, Francisco Jáñez-Martino

Research output: Contribution to journalArticlepeer-review

Abstract

Pastebin is an online notepad service to share text anonymously. However, it could be misused to propagate suspicious or even illegal activities, like leaking sensitive information or sharing hyperlinks to child sexual abuse material. Due to the high rate of daily upload pastes, manual inspection of this material is not feasible. Conversely, an automatic classifier could identify such activities with little or no human intervention. However, a supervised model may require a significant number of training samples and have to handle distinct text typologies presented in Pastebin. This paper presents a classification approach composed of three cascading supervised classifiers that use Active Learning to select and label the most informative samples from Pastebin. The modularity of the proposed design allows each classifier to adapt to a specific text typology. The first classifier determines whether the text is a code snippet, and the second is to identify whether it is readable. The third classification level is twofold: (i) a binary classifier to say whether the text is suspicious and (ii) a multiclass classifier with seven predefined categories of possibly illegal activities. The average class recall of the binary and multiclass classifiers is 95.24% and 80.33%, respectively. Additionally, this paper presents a dataset of 3.8 million Pastebin samples, called onlIne Notepad Services PastEbin aCtiviTies (INSPECT-3.8M), along with their labels using our classification framework. Our classifier recognised that 7.54% of the collected samples are correlated with presumably criminal activities. Law enforcement agencies may benefit from the insights shared in our research when aiming to investigate or automate the monitoring of Pastebin or other Online Notepad Services. This would allow responsible authorities to block illegal content before it spreads to the public.

Original languageEnglish
JournalJournal of Intelligent Information Systems
DOIs
Publication statusPublished - 17 Oct 2024

Keywords

  • Active learning
  • Machine learning
  • Pastebin
  • Suspicious activities

Fingerprint

Dive into the research topics of 'Classifying the content of online notepad services using active learning'. Together they form a unique fingerprint.

Cite this