TY - GEN
T1 - A methodology to cluster informal language register data
AU - Perez-Tellez, Fernando
AU - Pinto, David
AU - Cardiff, John
AU - Rosso, Paolo
PY - 2009
Y1 - 2009
N2 - Analyzing and classifying web content is a task that has been attracting an increasing amount of interest in recent years. However there are additional challenges to face with user generated content emanating from Web 2.0 applications such as blogs. commentaries, reviews etc. The typical characteristics of this information include features such as shortness, overlapping vocabulary, and vocabulary size and nature that make it difficult to achieve good results using automated clustering processes. The Web 2.0 informal written register introduces further challenges, containing incomplete sentences, misspellings, spontaneous structures etc. These characteristics make it difficult to select or employ external resources to improve the clustering performance, hi this work we apply a methodology that does not rely on any external resources in order to automatically cluster this data. This approach improves the representation of informal language register data by using a term enriching procedure and also uses a term selection technique to identify the most important and discriminative information .Our results show that this technique can produce significant improvements in the quality of clusters produced.
AB - Analyzing and classifying web content is a task that has been attracting an increasing amount of interest in recent years. However there are additional challenges to face with user generated content emanating from Web 2.0 applications such as blogs. commentaries, reviews etc. The typical characteristics of this information include features such as shortness, overlapping vocabulary, and vocabulary size and nature that make it difficult to achieve good results using automated clustering processes. The Web 2.0 informal written register introduces further challenges, containing incomplete sentences, misspellings, spontaneous structures etc. These characteristics make it difficult to select or employ external resources to improve the clustering performance, hi this work we apply a methodology that does not rely on any external resources in order to automatically cluster this data. This approach improves the representation of informal language register data by using a term enriching procedure and also uses a term selection technique to identify the most important and discriminative information .Our results show that this technique can produce significant improvements in the quality of clusters produced.
KW - Blogs
KW - Clustering
KW - Web 2.0
UR - https://www.scopus.com/pages/publications/84872148131
M3 - Conference contribution
AN - SCOPUS:84872148131
SN - 9780972741279
T3 - Proceedings of the 4th Indian International Conference on Artificial Intelligence, IICAI 2009
SP - 1391
EP - 1401
BT - Proceedings of the 4th Indian International Conference on Artificial Intelligence, IICAI 2009
T2 - 4th Indian International Conference on Artificial Intelligence, IICAI 2009
Y2 - 16 December 2009 through 18 December 2009
ER -