Feature based and feature free textual CBR: a comparison in spam filtering

Sarah Jane Delany, Derek Bridge

Research output: Contribution to conferencePaperpeer-review

Abstract

Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system.
Original languageEnglish
DOIs
Publication statusPublished - 2006
Event17th. Irish Conference on Artificial Intelligence and Cognitive Science -
Duration: 1 Jan 2006 → …

Conference

Conference17th. Irish Conference on Artificial Intelligence and Cognitive Science
Period1/01/06 → …

Keywords

  • Spam filtering
  • text classification
  • Case-Based Reasoning
  • ECUE system
  • feature-based
  • feature-free
  • text compression
  • distance measure
  • concept drift
  • empirical comparison
  • compression algorithms
  • Lempel-Ziv compressor
  • GZip
  • statistical compressor
  • PPM

Fingerprint

Dive into the research topics of 'Feature based and feature free textual CBR: a comparison in spam filtering'. Together they form a unique fingerprint.

Cite this