Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches

Sarah Jane Delany, Derek Bridge

    Research output: Contribution to journalArticlepeer-review

    9 Citations (Scopus)

    Abstract

    Spam filtering is a text classification task towhich Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel- Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalization accuracy.We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach.

    Original languageEnglish
    Pages (from-to)75-87
    Number of pages13
    JournalArtificial Intelligence Review
    Volume26
    Issue number1-2
    DOIs
    Publication statusPublished - Oct 2006

    Keywords

    • Case-base editing
    • Case-based maintenance
    • Case-based reasoning
    • Distance measures
    • Feature selection
    • Spam filtering
    • Text compression

    Fingerprint

    Dive into the research topics of 'Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches'. Together they form a unique fingerprint.

    Cite this