Abstract
A new benchmark called GISA is introduced for evaluating information-seeking assistants, featuring human-crafted queries with structured answer formats and live updates to prevent memorization.
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
Community
A New Benchmark for General Information Seeking Assistant
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents (2026)
- M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning (2026)
- Table-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion (2026)
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (2026)
- Over-Searching in Search-Augmented Large Language Models (2026)
- Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models (2026)
- SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper