Document categorization with modified statistical language models for agglutinative languages

Ahmet Cüneyd Tantug*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)


In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.

Original languageEnglish
Pages (from-to)632-645
Number of pages14
JournalInternational Journal of Computational Intelligence Systems
Issue number5
Publication statusPublished - Oct 2010


  • Document categorization
  • N-gram
  • Statistical language modeling
  • Turkish


Dive into the research topics of 'Document categorization with modified statistical language models for agglutinative languages'. Together they form a unique fingerprint.

Cite this