Abstract
Cyber security initiatives are finding new approaches to mitigating threats against the computational infrastructure of companies. One of these approaches is the use text mining techniques and classification models to detect potentially malicious messages or posts in hacker communications. This is a difficult task due the ambiguity and the strong use of technical vocabulary inherent in such posts. This paper aims to evaluate the use of robust language models for feature representation of input to downstream classification tasks of hacker communication posts. We perform the experiment against five hacker forum datasets using a variety of language models: two Word Embeddings (Word2vec and Glove), and three Sentence Embeddings (Sent2vec, InferSent and SentEncoder). We conclude that, for this task, only Sentence Embeddings enhance the performance of SVM classification models compared to traditional language models (Bag-of-words, word/char n-grams). Additionally, we found that models using CNN improves upon SVM models by achieving 93% of positive recall and 96% of average class accuracy.
| Original language | English |
|---|---|
| Pages (from-to) | 116-127 |
| Number of pages | 12 |
| Journal | CEUR Workshop Proceedings |
| Volume | 2563 |
| Publication status | Published - 2019 |
| Event | 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2019 - Galway, Ireland Duration: 5 Dec 2019 → 6 Dec 2019 |
Keywords
- CNN
- Cybersecurity
- SVM
- Sentence Embeddings
- Text Mining
- Threat intelligence
- Word Embeddings
Fingerprint
Dive into the research topics of 'Detecting hacker threats: Performance of word and sentence embedding models in identifying hacker communications'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver