Improving file-level fuzzy hashes for malware variant classification

Research output: Contribution to journalArticlepeer-review

Abstract

Malware analysts need to be able to accurately and swiftly predict family membership as well as to determine that a suspect file contains malicious content. Previous research has shown that fuzzy hashing can be used to determine whether a file is malicious and to cluster like files together, but it does not specifically address the problem of malware variant classification. Existing tools such as VirusTotal maintain file and section level cryptographic hashes and ssdeep file digests but they do not maintain section-level similarity hashes or provide a means to submit similarity hashes and compare them to previously analyzed samples. This paper presents a study on the feasibility of using section-level similarity hashing as a means of classifying malware variants. The aim of the study was to produce a method to overcome the limitations of file-level similarity hashing, such as poor performance against obfuscated malware. Section-level similarity hashing involves splitting malware executables into their binary headers and sections and applying a similarity digest on each resulting binary chunk. Experiments with known malware families were conducted using file and section level digests where each method was used to predict malware family membership. The performance of both methods was evaluated using precision, recall and accuracy metrics. The results show that similarity digests can be used to classify malware in Windows Portable Executable (PE)files and that section-level hashing and comparison produces considerably better results than at file-level.

Original languageEnglish
Pages (from-to)S88-S94
JournalDigital Investigation
Volume28
DOIs
Publication statusPublished - Apr 2019

Keywords

  • Classification
  • File-level
  • Fuzzy hash
  • Malware
  • PE file
  • Section-level
  • Similarity digest

Fingerprint

Dive into the research topics of 'Improving file-level fuzzy hashes for malware variant classification'. Together they form a unique fingerprint.

Cite this