Defining and evaluating blog characteristics

Fernando Perez-Tellez, David Pinto, John Cardiff, Paolo Rosso

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    The analysis of weblogs has become a popular area of natural language processing. Due to their specific characteristics, such as shortness, vocabulary size and nature, etc. it can be difficult to achieve good results using automated clustering techniques. In particular, their nature can vary considerably, both in length and in breadth of topic. Without a priori knowledge of the nature of a blog it is difficult to achieve accurate clustering results. In this paper, we present a framework for the assessment of a set of corpus features that will provide us with insight into their nature from a number of perspectives including shortness, broadness and class imbalance. This in turn allows us to assess the relative hardness of the clustering task and to identify components that can improve the accuracy of the clustering task. We furthermore present the results of some experiments in which we analyzed the features of two sample blog corpora, and we compared the results with other kinds of short texts.

    Original languageEnglish
    Title of host publication8th Mexican International Conference on Artificial Intelligence - Proceedings of the Special Session, MICAI 2009
    Pages97-102
    Number of pages6
    DOIs
    Publication statusPublished - 2009
    Event8th Mexican International Conference on Artificial Intelligence, MICAI 2009 - Guanajuato, Guanajuato, Mexico
    Duration: 9 Nov 200913 Nov 2009

    Publication series

    Name8th Mexican International Conference on Artificial Intelligence - Proceedings of the Special Session, MICAI 2009

    Conference

    Conference8th Mexican International Conference on Artificial Intelligence, MICAI 2009
    Country/TerritoryMexico
    CityGuanajuato, Guanajuato
    Period9/11/0913/11/09

    Keywords

    • Blogs
    • Characterization
    • Short text

    Fingerprint

    Dive into the research topics of 'Defining and evaluating blog characteristics'. Together they form a unique fingerprint.

    Cite this