Abstract
Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches. Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations. Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments. Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.
| Original language | English |
|---|---|
| Article number | 1611389 |
| Journal | Frontiers in Big Data |
| Volume | 8 |
| DOIs | |
| Publication status | Published - 2025 |
Bibliographical note
Publisher Copyright:Copyright © 2025 Baysan, Uysal, İşlek, Çığ Karaman and Güngör.
Keywords
- automatic evaluation
- evaluation framework
- generative search
- large language models
- LLM-as-a-Judge
- query understanding
- search query parsing
- structured output evaluation