Abstract
In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.
Original language | English |
---|---|
Pages (from-to) | 632-645 |
Number of pages | 14 |
Journal | International Journal of Computational Intelligence Systems |
Volume | 3 |
Issue number | 5 |
DOIs | |
Publication status | Published - Oct 2010 |
Bibliographical note
Publisher Copyright:© 2010, the authors.
Keywords
- document categorization
- n-gram
- statistical language modeling
- Turkish