TY - JOUR
T1 - Detecting hacker threats
T2 - 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2019
AU - Queiroz, Andrei Lima
AU - McKeever, Susan
AU - Keegan, Brian
N1 - Publisher Copyright:
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
PY - 2019
Y1 - 2019
N2 - Cyber security initiatives are finding new approaches to mitigating threats against the computational infrastructure of companies. One of these approaches is the use text mining techniques and classification models to detect potentially malicious messages or posts in hacker communications. This is a difficult task due the ambiguity and the strong use of technical vocabulary inherent in such posts. This paper aims to evaluate the use of robust language models for feature representation of input to downstream classification tasks of hacker communication posts. We perform the experiment against five hacker forum datasets using a variety of language models: two Word Embeddings (Word2vec and Glove), and three Sentence Embeddings (Sent2vec, InferSent and SentEncoder). We conclude that, for this task, only Sentence Embeddings enhance the performance of SVM classification models compared to traditional language models (Bag-of-words, word/char n-grams). Additionally, we found that models using CNN improves upon SVM models by achieving 93% of positive recall and 96% of average class accuracy.
AB - Cyber security initiatives are finding new approaches to mitigating threats against the computational infrastructure of companies. One of these approaches is the use text mining techniques and classification models to detect potentially malicious messages or posts in hacker communications. This is a difficult task due the ambiguity and the strong use of technical vocabulary inherent in such posts. This paper aims to evaluate the use of robust language models for feature representation of input to downstream classification tasks of hacker communication posts. We perform the experiment against five hacker forum datasets using a variety of language models: two Word Embeddings (Word2vec and Glove), and three Sentence Embeddings (Sent2vec, InferSent and SentEncoder). We conclude that, for this task, only Sentence Embeddings enhance the performance of SVM classification models compared to traditional language models (Bag-of-words, word/char n-grams). Additionally, we found that models using CNN improves upon SVM models by achieving 93% of positive recall and 96% of average class accuracy.
KW - CNN
KW - Cybersecurity
KW - SVM
KW - Sentence Embeddings
KW - Text Mining
KW - Threat intelligence
KW - Word Embeddings
UR - http://www.scopus.com/inward/record.url?scp=85081635826&partnerID=8YFLogxK
UR - https://arrow.tudublin.ie/scschcomcon/280/
M3 - Conference article
AN - SCOPUS:85081635826
SN - 1613-0073
VL - 2563
SP - 116
EP - 127
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
Y2 - 5 December 2019 through 6 December 2019
ER -