Abstract
Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system.
Original language | English |
---|---|
DOIs | |
Publication status | Published - 2006 |
Event | 17th. Irish Conference on Artificial Intelligence and Cognitive Science - Duration: 1 Jan 2006 → … |
Conference
Conference | 17th. Irish Conference on Artificial Intelligence and Cognitive Science |
---|---|
Period | 1/01/06 → … |
Keywords
- Spam filtering
- text classification
- Case-Based Reasoning
- ECUE system
- feature-based
- feature-free
- text compression
- distance measure
- concept drift
- empirical comparison
- compression algorithms
- Lempel-Ziv compressor
- GZip
- statistical compressor
- PPM