TY - JOUR
T1 - Classifying the content of online notepad services using active learning
AU - Al-Nabki, Mhd Wesam
AU - Fidalgo, Eduardo
AU - Alegre, Enrique
AU - Delany, Sarah Jane
AU - Jáñez-Martino, Francisco
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/10/17
Y1 - 2024/10/17
N2 - Pastebin is an online notepad service to share text anonymously. However, it could be misused to propagate suspicious or even illegal activities, like leaking sensitive information or sharing hyperlinks to child sexual abuse material. Due to the high rate of daily upload pastes, manual inspection of this material is not feasible. Conversely, an automatic classifier could identify such activities with little or no human intervention. However, a supervised model may require a significant number of training samples and have to handle distinct text typologies presented in Pastebin. This paper presents a classification approach composed of three cascading supervised classifiers that use Active Learning to select and label the most informative samples from Pastebin. The modularity of the proposed design allows each classifier to adapt to a specific text typology. The first classifier determines whether the text is a code snippet, and the second is to identify whether it is readable. The third classification level is twofold: (i) a binary classifier to say whether the text is suspicious and (ii) a multiclass classifier with seven predefined categories of possibly illegal activities. The average class recall of the binary and multiclass classifiers is 95.24% and 80.33%, respectively. Additionally, this paper presents a dataset of 3.8 million Pastebin samples, called onlIne Notepad Services PastEbin aCtiviTies (INSPECT-3.8M), along with their labels using our classification framework. Our classifier recognised that 7.54% of the collected samples are correlated with presumably criminal activities. Law enforcement agencies may benefit from the insights shared in our research when aiming to investigate or automate the monitoring of Pastebin or other Online Notepad Services. This would allow responsible authorities to block illegal content before it spreads to the public.
AB - Pastebin is an online notepad service to share text anonymously. However, it could be misused to propagate suspicious or even illegal activities, like leaking sensitive information or sharing hyperlinks to child sexual abuse material. Due to the high rate of daily upload pastes, manual inspection of this material is not feasible. Conversely, an automatic classifier could identify such activities with little or no human intervention. However, a supervised model may require a significant number of training samples and have to handle distinct text typologies presented in Pastebin. This paper presents a classification approach composed of three cascading supervised classifiers that use Active Learning to select and label the most informative samples from Pastebin. The modularity of the proposed design allows each classifier to adapt to a specific text typology. The first classifier determines whether the text is a code snippet, and the second is to identify whether it is readable. The third classification level is twofold: (i) a binary classifier to say whether the text is suspicious and (ii) a multiclass classifier with seven predefined categories of possibly illegal activities. The average class recall of the binary and multiclass classifiers is 95.24% and 80.33%, respectively. Additionally, this paper presents a dataset of 3.8 million Pastebin samples, called onlIne Notepad Services PastEbin aCtiviTies (INSPECT-3.8M), along with their labels using our classification framework. Our classifier recognised that 7.54% of the collected samples are correlated with presumably criminal activities. Law enforcement agencies may benefit from the insights shared in our research when aiming to investigate or automate the monitoring of Pastebin or other Online Notepad Services. This would allow responsible authorities to block illegal content before it spreads to the public.
KW - Active learning
KW - Machine learning
KW - Pastebin
KW - Suspicious activities
UR - http://www.scopus.com/inward/record.url?scp=85207265965&partnerID=8YFLogxK
U2 - 10.1007/s10844-024-00902-8
DO - 10.1007/s10844-024-00902-8
M3 - Article
AN - SCOPUS:85207265965
SN - 0925-9902
JO - Journal of Intelligent Information Systems
JF - Journal of Intelligent Information Systems
ER -