TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

Ali Bugra Kanburoglu*, Faik Boray Tek

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.

Original languageEnglish
JournalIEEE Access
DOIs
Publication statusAccepted/In press - 2024

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • dataset
  • large language models
  • LLM
  • Text-to-SQL
  • Turkish
  • TURSpider

Fingerprint

Dive into the research topics of 'TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study'. Together they form a unique fingerprint.

Cite this