Detecting hacker threats: Performance of word and sentence embedding models in identifying hacker communications

Andrei Lima Queiroz, Susan McKeever, Brian Keegan

Research output: Contribution to journalConference articlepeer-review

7 Citations (Scopus)

Abstract

Cyber security initiatives are finding new approaches to mitigating threats against the computational infrastructure of companies. One of these approaches is the use text mining techniques and classification models to detect potentially malicious messages or posts in hacker communications. This is a difficult task due the ambiguity and the strong use of technical vocabulary inherent in such posts. This paper aims to evaluate the use of robust language models for feature representation of input to downstream classification tasks of hacker communication posts. We perform the experiment against five hacker forum datasets using a variety of language models: two Word Embeddings (Word2vec and Glove), and three Sentence Embeddings (Sent2vec, InferSent and SentEncoder). We conclude that, for this task, only Sentence Embeddings enhance the performance of SVM classification models compared to traditional language models (Bag-of-words, word/char n-grams). Additionally, we found that models using CNN improves upon SVM models by achieving 93% of positive recall and 96% of average class accuracy.

Original languageEnglish
Pages (from-to)116-127
Number of pages12
JournalCEUR Workshop Proceedings
Volume2563
Publication statusPublished - 2019
Event27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2019 - Galway, Ireland
Duration: 5 Dec 20196 Dec 2019

Keywords

  • CNN
  • Cybersecurity
  • SVM
  • Sentence Embeddings
  • Text Mining
  • Threat intelligence
  • Word Embeddings

Fingerprint

Dive into the research topics of 'Detecting hacker threats: Performance of word and sentence embedding models in identifying hacker communications'. Together they form a unique fingerprint.

Cite this