LLM-as-a-Judge: automated evaluation of search query parsing using large language models

Mehmet Selman Baysan*, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches. Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations. Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments. Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.

Original languageEnglish
Article number1611389
JournalFrontiers in Big Data
Volume8
DOIs
Publication statusPublished - 2025

Bibliographical note

Publisher Copyright:
Copyright © 2025 Baysan, Uysal, İşlek, Çığ Karaman and Güngör.

Keywords

  • automatic evaluation
  • evaluation framework
  • generative search
  • large language models
  • LLM-as-a-Judge
  • query understanding
  • search query parsing
  • structured output evaluation

Fingerprint

Dive into the research topics of 'LLM-as-a-Judge: automated evaluation of search query parsing using large language models'. Together they form a unique fingerprint.

Cite this