TY - JOUR
T1 - Malware family classification via efficient Huffman features
AU - O'Shaughnessy, Stephen
AU - Breitinger, Frank
N1 - Publisher Copyright:
© 2021 The Authors
PY - 2021/7
Y1 - 2021/7
N2 - As malware evolves and becomes more complex, researchers strive to develop detection and classification schemes that abstract away from the internal intricacies of binary code to represent malware without the need for architectural knowledge or invasive analysis procedures. Such approaches can reduce the complexities of feature generation and simplify the analysis process. In this paper, we present efficient Huffman features (eHf), a novel compression-based approach to feature construction, based on Huffman encoding, where malware features are represented in a compact format, without the need for intrusive reverse-engineering or dynamic analysis processes. We demonstrate the viability of eHf as a solution for classifying malware into their respective families on a large malware corpus of 15 k samples, indicative of the current threat landscape. We evaluate eHf against current compression-based alternatives and show that our method is comparable or superior for classification accuracy, while exhibiting considerably greater runtime efficiency. Finally we demonstrate that eHf is resilient against code reordering obfuscation.
AB - As malware evolves and becomes more complex, researchers strive to develop detection and classification schemes that abstract away from the internal intricacies of binary code to represent malware without the need for architectural knowledge or invasive analysis procedures. Such approaches can reduce the complexities of feature generation and simplify the analysis process. In this paper, we present efficient Huffman features (eHf), a novel compression-based approach to feature construction, based on Huffman encoding, where malware features are represented in a compact format, without the need for intrusive reverse-engineering or dynamic analysis processes. We demonstrate the viability of eHf as a solution for classifying malware into their respective families on a large malware corpus of 15 k samples, indicative of the current threat landscape. We evaluate eHf against current compression-based alternatives and show that our method is comparable or superior for classification accuracy, while exhibiting considerably greater runtime efficiency. Finally we demonstrate that eHf is resilient against code reordering obfuscation.
KW - Compression
KW - Feature construction
KW - Huffman encoding
KW - Machine learning
KW - Malware abstraction
KW - Malware classification
UR - http://www.scopus.com/inward/record.url?scp=85112563130&partnerID=8YFLogxK
U2 - 10.1016/j.fsidi.2021.301192
DO - 10.1016/j.fsidi.2021.301192
M3 - Article
AN - SCOPUS:85112563130
SN - 2666-2825
VL - 37
JO - Forensic Science International: Digital Investigation
JF - Forensic Science International: Digital Investigation
M1 - 301192
ER -