Effect of tokenization granularity for Turkish large language models

Yiğit Bekir Kaya*, A. Cüneyd Tantuğ

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Transformer-based language models such as BERT (and its optimized versions) have outperformed previous models, achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with complex morphology, such as Turkish. While previous research has often focused on tokenization algorithms and has explored optimal vocabulary sizes for machine translation in English, our study extends the scope by investigating the impact of varying vocabulary sizes and explores the feasilitiy of incorporating morphological tagging for Turkish. The granularity of the generated tokens is a feature determined by various factors related to tokenization, especially by the vocabulary size. This study presents a new collection of BERT models (ITUTurkBERT) trained using various tokenization methods on the corpus of the BERTurk and 1 BW corpora. We fine-tuned these models for named entity recognition, sentiment analysis, and question-answering downstream tasks in Turkish and achieved state-of-the-art performance on all of these tasks. Our empirical experiments show that increasing the vocabulary size improves performance on these tasks, except for sentiment analysis, which requires further investigation.

Original languageEnglish
Article number200335
JournalIntelligent Systems with Applications
Volume21
DOIs
Publication statusPublished - Mar 2024

Bibliographical note

Publisher Copyright:
© 2024 The Author(s)

Funding

We are grateful for Google's TPU Research Cloud (TRC) support in providing us with Cloud TPUs. We also thank Stefan Schweter for generously sharing the BERTurk dataset with us and enabling a fair comparison.

FundersFunder number
Google's TPU Research Cloud
The Research Council

    Keywords

    • BERT
    • Downstream tasks
    • Hyperparameter tuning
    • Tokenization
    • Turkish

    Fingerprint

    Dive into the research topics of 'Effect of tokenization granularity for Turkish large language models'. Together they form a unique fingerprint.

    Cite this