Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition

Yiğit Bekir Kaya, A. Cüneyd Tantuğ

Research output: Contribution to journalConference articlepeer-review

Abstract

Transformer-based language models such as BERT [1] (and its optimized versions) have outperformed previous models achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with a complex morphology. The granularity of the generated tokens is a hyperparameter to be tuned, being determined by the vocabulary size. Remarkably, the effect of this hyperparameter is not widely studied, and it is either chosen arbitrarily or via trial-and-error in practice. Considering Turkish’s complex productive morphological structure, the granularity hyperparameter plays a vital role as a significant hyperparameter to be tuned compared to English. In this work, we present novel BERT models (named ITUTurkBERT) pretrained with various vocabulary sizes from scratch on BERTurk corpus [2] and fine-tuned for named entity recognition (NER) downstream task in the Turkish language, achieving state-of-the-art performance (average 5-fold CoNLL F1 score of 0.9372) on the WikiANN dataset [3]. The empirical experiments demonstrate that increasing the vocabulary size leads to a high level of token granularity, which also achieves better NER performance.

Original languageEnglish
Pages (from-to)99-106
Number of pages8
JournalCEUR Workshop Proceedings
Volume3315
Publication statusPublished - 2022
Event2022 International Conference and Workshop on Agglutanative Language Technologies as a Challenge of Natural Language Processing, ALTNLP 2022 - Virtual, Online, Slovenia
Duration: 7 Jun 20228 Jun 2022

Bibliographical note

Publisher Copyright:
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Funding

Our research is supported by Cloud TPUs from Google’s TPU Research Cloud (TRC), enabling us to achieve SotA results. We also thank Stefan Schweter for providing us with the BERTurk dataset, which we used for pretraining our models to compare their performance to the original BERTurk model fairly.

FundersFunder number
Google’s TPU Research Cloud
The Research Council

    Keywords

    • BERT
    • hyperparameter tuning
    • named entity recognition
    • Turkish

    Fingerprint

    Dive into the research topics of 'Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition'. Together they form a unique fingerprint.

    Cite this