Abstract
Transformer-based language models such as BERT [1] (and its optimized versions) have outperformed previous models achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with a complex morphology. The granularity of the generated tokens is a hyperparameter to be tuned, being determined by the vocabulary size. Remarkably, the effect of this hyperparameter is not widely studied, and it is either chosen arbitrarily or via trial-and-error in practice. Considering Turkish’s complex productive morphological structure, the granularity hyperparameter plays a vital role as a significant hyperparameter to be tuned compared to English. In this work, we present novel BERT models (named ITUTurkBERT) pretrained with various vocabulary sizes from scratch on BERTurk corpus [2] and fine-tuned for named entity recognition (NER) downstream task in the Turkish language, achieving state-of-the-art performance (average 5-fold CoNLL F1 score of 0.9372) on the WikiANN dataset [3]. The empirical experiments demonstrate that increasing the vocabulary size leads to a high level of token granularity, which also achieves better NER performance.
Original language | English |
---|---|
Pages (from-to) | 99-106 |
Number of pages | 8 |
Journal | CEUR Workshop Proceedings |
Volume | 3315 |
Publication status | Published - 2022 |
Event | 2022 International Conference and Workshop on Agglutanative Language Technologies as a Challenge of Natural Language Processing, ALTNLP 2022 - Virtual, Online, Slovenia Duration: 7 Jun 2022 → 8 Jun 2022 |
Bibliographical note
Publisher Copyright:© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Funding
Our research is supported by Cloud TPUs from Google’s TPU Research Cloud (TRC), enabling us to achieve SotA results. We also thank Stefan Schweter for providing us with the BERTurk dataset, which we used for pretraining our models to compare their performance to the original BERTurk model fairly.
Funders | Funder number |
---|---|
Google’s TPU Research Cloud | |
The Research Council |
Keywords
- BERT
- hyperparameter tuning
- named entity recognition
- Turkish