Effect of tokenization granularity for Turkish large language models

Yiğit Bekir Kaya*, A. Cüneyd Tantuğ

*Bu çalışma için yazışmadan sorumlu yazar

Araştırma sonucu: Dergiye katkıMakalebilirkişi

6 Atıf (Scopus)

Özet

Transformer-based language models such as BERT (and its optimized versions) have outperformed previous models, achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with complex morphology, such as Turkish. While previous research has often focused on tokenization algorithms and has explored optimal vocabulary sizes for machine translation in English, our study extends the scope by investigating the impact of varying vocabulary sizes and explores the feasilitiy of incorporating morphological tagging for Turkish. The granularity of the generated tokens is a feature determined by various factors related to tokenization, especially by the vocabulary size. This study presents a new collection of BERT models (ITUTurkBERT) trained using various tokenization methods on the corpus of the BERTurk and 1 BW corpora. We fine-tuned these models for named entity recognition, sentiment analysis, and question-answering downstream tasks in Turkish and achieved state-of-the-art performance on all of these tasks. Our empirical experiments show that increasing the vocabulary size improves performance on these tasks, except for sentiment analysis, which requires further investigation.

Orijinal dilİngilizce
Makale numarası200335
DergiIntelligent Systems with Applications
Hacim21
DOI'lar
Yayın durumuYayınlandı - Mar 2024

Bibliyografik not

Publisher Copyright:
© 2024 The Author(s)

Finansman

We are grateful for Google's TPU Research Cloud (TRC) support in providing us with Cloud TPUs. We also thank Stefan Schweter for generously sharing the BERTurk dataset with us and enabling a fair comparison.

FinansörlerFinansör numarası
Google's TPU Research Cloud
The Research Council

    Parmak izi

    Effect of tokenization granularity for Turkish large language models' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.

    Alıntı Yap