TY - GEN
T1 - Moving Targets
T2 - 2020 International Conference on Cyber Security and Protection of Digital Services, Cyber Security 2020
AU - Queiroz, Andrei Lima
AU - Keegan, Brian
AU - McKeever, Susan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - In this paper, we are investigating the presence of concept drift in machine learning models for detection of hacker communications posted in social media and hacker forums. The supervised models in this experiment are analysed in terms of performance over time by different sources of data (Surface web and Deep web). Additionally, to simulate real-world situations, these models are evaluated using time-stamped messages from our datasets, posted over time on social media platforms. We have found that models applied to hacker forums (deep web) presents an accuracy deterioration in less than a 1-year period, whereas models applied to Twitter (surface web) have not shown a decrease in accuracy for the same period of time. The problem is alleviated by retraining the model with new instances (and applying weights) in order to reduce the effects of concept drift. While our results indicated that performance degradation due to concept drift is avoided by 50% relabelling, which is challenging in real-world scenarios, our work paves the way to more targeted concept drift solutions to reduce the re-training tasks.
AB - In this paper, we are investigating the presence of concept drift in machine learning models for detection of hacker communications posted in social media and hacker forums. The supervised models in this experiment are analysed in terms of performance over time by different sources of data (Surface web and Deep web). Additionally, to simulate real-world situations, these models are evaluated using time-stamped messages from our datasets, posted over time on social media platforms. We have found that models applied to hacker forums (deep web) presents an accuracy deterioration in less than a 1-year period, whereas models applied to Twitter (surface web) have not shown a decrease in accuracy for the same period of time. The problem is alleviated by retraining the model with new instances (and applying weights) in order to reduce the effects of concept drift. While our results indicated that performance degradation due to concept drift is avoided by 50% relabelling, which is challenging in real-world scenarios, our work paves the way to more targeted concept drift solutions to reduce the re-training tasks.
KW - Concept Drift
KW - Cyber Security
KW - Hacker Communication
KW - Machine Learning
KW - Software Vulnerabilities
UR - http://www.scopus.com/inward/record.url?scp=85091978046&partnerID=8YFLogxK
U2 - 10.1109/CyberSecurity49315.2020.9138894
DO - 10.1109/CyberSecurity49315.2020.9138894
M3 - Conference contribution
AN - SCOPUS:85091978046
T3 - International Conference on Cyber Security and Protection of Digital Services, Cyber Security 2020
BT - International Conference on Cyber Security and Protection of Digital Services, Cyber Security 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 June 2020 through 19 June 2020
ER -