TY - JOUR
T1 - Document Categorization with Modified Statistical Language Models for Agglutinative Languages
AU - Tantuğ, Ahmet Cüneyd
N1 - Publisher Copyright:
© 2010, the authors.
PY - 2010/10
Y1 - 2010/10
N2 - In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.
AB - In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.
KW - document categorization
KW - n-gram
KW - statistical language modeling
KW - Turkish
UR - http://www.scopus.com/inward/record.url?scp=85180546813&partnerID=8YFLogxK
U2 - 10.2991/ijcis.2010.3.5.12
DO - 10.2991/ijcis.2010.3.5.12
M3 - Article
AN - SCOPUS:85180546813
SN - 1875-6891
VL - 3
SP - 632
EP - 645
JO - International Journal of Computational Intelligence Systems
JF - International Journal of Computational Intelligence Systems
IS - 5
ER -