Empirical comparative analysis of 1-of-K coding and K-prototypes in categorical clustering

Fei Wang, Hector Franco, John Pugh, Robert Ross

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

Clustering is a fundamental machine learning application, which partitions data into homogeneous groups. K-means and its variants are the most widely used class of clustering algorithms today. However, the original k-means algorithm can only be applied to numeric data. For categorical data, the data has to be converted into numeric data through 1-of-K coding which itself causes many problems. K-prototypes, another clustering algorithm that originates from the k-means algorithm, can handle categorical data by adopting a different notion of distance. In this paper, we systematically compare these two methods through an experimental analysis. Our analysis shows that K-prototypes is more suited when the dataset is large-scaled, while the performance of k-means with 1-of-K coding is more stable. We believe these are useful heuristics for clustering methods working with highly categorical data.

Original languageEnglish
Pages (from-to)248-259
Number of pages12
JournalCEUR Workshop Proceedings
Volume1751
DOIs
Publication statusPublished - 2016
Event24th Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2016 - Dublin, Ireland
Duration: 20 Sep 201621 Sep 2016

Keywords

  • Categorical data
  • Clustering
  • Clustering validity
  • Efficiency
  • K-means
  • K-prototypes

Fingerprint

Dive into the research topics of 'Empirical comparative analysis of 1-of-K coding and K-prototypes in categorical clustering'. Together they form a unique fingerprint.

Cite this