Ana gezinime geç Aramaya geç Ana içeriğe geç

TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations

  • Bogazici University

Araştırma sonucu: Kitap/Rapor/Konferans Bildirisinde BölümKonferans katkısıbilirkişi

Özet

We introduce TR-MTEB, the first large-scale, task-diverse benchmark designed to evaluate sentence embedding models for Turkish. Covering six core tasks as classification, clustering, pair classification, retrieval, bitext mining, and semantic textual similarity, TR-MTEB incorporates 26 high-quality datasets, including native and translated resources. To complement this benchmark, we construct a corpus of 34.2 million weakly supervised Turkish sentence pairs and train two Turkish-specific embedding models using contrastive pretraining and supervised fine-tuning. Evaluation results show that our models, despite being trained on limited resources, achieve competitive performance across most tasks and significantly improve upon baseline monolingual models. All datasets, models, and evaluation pipelines are publicly released1 to facilitate further research in Turkish natural language processing and low-resource benchmarking.

Orijinal dilİngilizce
Ana bilgisayar yayını başlığıEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
EditörlerChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
YayınlayanAssociation for Computational Linguistics (ACL)
Sayfalar8867-8887
Sayfa sayısı21
ISBN (Elektronik)9798891763357
DOI'lar
Yayın durumuYayınlandı - 2025
Harici olarak yayınlandıEvet
Etkinlik30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China
Süre: 4 Kas 20259 Kas 2025

Yayın serisi

AdıEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Ülke/BölgeChina
ŞehirSuzhou
Periyot4/11/259/11/25

Bibliyografik not

Publisher Copyright:
©2025 Association for Computational Linguistics.

Parmak izi

TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.

Alıntı Yap