Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition

Yiğit Bekir Kaya, A. Cüneyd Tantuğ

Research output: Contribution to journalConference articlepeer-review

Abstract

Transformer-based language models such as BERT [1] (and its optimized versions) have outperformed previous models achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with a complex morphology. The granularity of the generated tokens is a hyperparameter to be tuned, being determined by the vocabulary size. Remarkably, the effect of this hyperparameter is not widely studied, and it is either chosen arbitrarily or via trial-and-error in practice. Considering Turkish’s complex productive morphological structure, the granularity hyperparameter plays a vital role as a significant hyperparameter to be tuned compared to English. In this work, we present novel BERT models (named ITUTurkBERT) pretrained with various vocabulary sizes from scratch on BERTurk corpus [2] and fine-tuned for named entity recognition (NER) downstream task in the Turkish language, achieving state-of-the-art performance (average 5-fold CoNLL F1 score of 0.9372) on the WikiANN dataset [3]. The empirical experiments demonstrate that increasing the vocabulary size leads to a high level of token granularity, which also achieves better NER performance.

Original languageEnglish
Pages (from-to)99-106
Number of pages8
JournalCEUR Workshop Proceedings
Volume3315
Publication statusPublished - 2022
Event2022 International Conference and Workshop on Agglutanative Language Technologies as a Challenge of Natural Language Processing, ALTNLP 2022 - Virtual, Online, Slovenia
Duration: 7 Jun 20228 Jun 2022

Bibliographical note

Publisher Copyright:
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Keywords

  • BERT
  • hyperparameter tuning
  • named entity recognition
  • Turkish

Fingerprint

Dive into the research topics of 'Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition'. Together they form a unique fingerprint.

Cite this