Preliminary investigation on using semi-supervised contextual word sense disambiguation for data augmentation

Dilara Torunoglu-Selamet, Arda Inceoglu, Gulsen Eryigit

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Recently, neural architectures play a significant role in the task of Word Sense Disambiguation (WSD). Supervised methods seem to be ahead of its rivals and their performance mostly depends on the size of training data. A numerous number of human-annotated data available for WSD task have been constructed for English. However, low-resource languages (LRLs) still face difficulty in finding suitable data resources. Gathering and annotating a sufficient amount of training data is a time-consuming and labor-expensive work. To address and overcome this problem, in this paper we investigate the possibility of using a semi-supervised context based WSD approach for data augmentation (in order to be later used for supervised learning). Since, it is even difficult to find WSD evaluation datasets for LRLs, in this study, we use English datasets to build a proof-of-concept and to evaluate their applicability onto LRLs. Our semi-supervised approach uses a seed set and context embeddings. We test with 9 different context based language models (including ELMo, BERT, RoBERTa etc.) and investigate their impacts on WSD. We increased our baseline results up to 28 percentage point improvements (baseline with ELMo 50.39% and ELMo Sense Seed Based Average Similarity Model 78.06%) in terms of accuracy. Our initial findings reveal that the proposed approach is very promising for the augmentation of WSD datasets of LRLs.

Original languageEnglish
Title of host publication5th International Conference on Computer Science and Engineering, UBMK 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages337-342
Number of pages6
ISBN (Electronic)9781728175652
DOIs
Publication statusPublished - Sept 2020
Event5th International Conference on Computer Science and Engineering, UBMK 2020 - Diyarbakir, Turkey
Duration: 9 Sept 202010 Sept 2020

Publication series

Name5th International Conference on Computer Science and Engineering, UBMK 2020

Conference

Conference5th International Conference on Computer Science and Engineering, UBMK 2020
Country/TerritoryTurkey
CityDiyarbakir
Period9/09/2010/09/20

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

Keywords

  • Contextual embeddings
  • Data augmentation
  • Deep learning
  • Word sense disambiguation

Fingerprint

Dive into the research topics of 'Preliminary investigation on using semi-supervised contextual word sense disambiguation for data augmentation'. Together they form a unique fingerprint.

Cite this