Weblog and short text feature extraction and impact on categorisation

Fernando Perez-Tellez, John Cardiff, Paolo Rosso, David Pinto

Research output: Contribution to journalArticlepeer-review

Abstract

The characterisation and categorisation of weblogs and other short texts has become an important research theme in the areas of topic/trend detection, and pattern recognition, amongst others. The value of analysing and characterising short text is to understand and identify the features that can identify and distinguish them, thereby improving input to the classification process. In this research work, we analyse a large number of text features and establish which combinations are useful to discriminate between the different genres of short text. Having identified the most promising features, we then confirm our findings by performing the categorisation task using three approaches: the Gaussian and SVM classifiers and the K-means clustering algorithm. Several hundred combinations of features were analysed in order to identify the best combinations and the results confirmed the observations made. The novel aspect of our work is the detection of the best combination of individual metrics which are identified as potential features to be used for the categorisation process.

Original languageEnglish
Pages (from-to)2529-2544
Number of pages16
JournalJournal of Intelligent and Fuzzy Systems
Volume27
Issue number5
DOIs
Publication statusPublished - 2014

Keywords

  • Short text characterisation
  • feature extraction

Fingerprint

Dive into the research topics of 'Weblog and short text feature extraction and impact on categorisation'. Together they form a unique fingerprint.

Cite this