TY - GEN
T1 - Defining and evaluating blog characteristics
AU - Perez-Tellez, Fernando
AU - Pinto, David
AU - Cardiff, John
AU - Rosso, Paolo
PY - 2009
Y1 - 2009
N2 - The analysis of weblogs has become a popular area of natural language processing. Due to their specific characteristics, such as shortness, vocabulary size and nature, etc. it can be difficult to achieve good results using automated clustering techniques. In particular, their nature can vary considerably, both in length and in breadth of topic. Without a priori knowledge of the nature of a blog it is difficult to achieve accurate clustering results. In this paper, we present a framework for the assessment of a set of corpus features that will provide us with insight into their nature from a number of perspectives including shortness, broadness and class imbalance. This in turn allows us to assess the relative hardness of the clustering task and to identify components that can improve the accuracy of the clustering task. We furthermore present the results of some experiments in which we analyzed the features of two sample blog corpora, and we compared the results with other kinds of short texts.
AB - The analysis of weblogs has become a popular area of natural language processing. Due to their specific characteristics, such as shortness, vocabulary size and nature, etc. it can be difficult to achieve good results using automated clustering techniques. In particular, their nature can vary considerably, both in length and in breadth of topic. Without a priori knowledge of the nature of a blog it is difficult to achieve accurate clustering results. In this paper, we present a framework for the assessment of a set of corpus features that will provide us with insight into their nature from a number of perspectives including shortness, broadness and class imbalance. This in turn allows us to assess the relative hardness of the clustering task and to identify components that can improve the accuracy of the clustering task. We furthermore present the results of some experiments in which we analyzed the features of two sample blog corpora, and we compared the results with other kinds of short texts.
KW - Blogs
KW - Characterization
KW - Short text
UR - http://www.scopus.com/inward/record.url?scp=77950999568&partnerID=8YFLogxK
U2 - 10.1109/MICAI.2009.21
DO - 10.1109/MICAI.2009.21
M3 - Conference contribution
AN - SCOPUS:77950999568
SN - 9780769539331
T3 - 8th Mexican International Conference on Artificial Intelligence - Proceedings of the Special Session, MICAI 2009
SP - 97
EP - 102
BT - 8th Mexican International Conference on Artificial Intelligence - Proceedings of the Special Session, MICAI 2009
T2 - 8th Mexican International Conference on Artificial Intelligence, MICAI 2009
Y2 - 9 November 2009 through 13 November 2009
ER -