T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Abstract
Structure of Thought prompting technique enhances language model performance by guiding explicit intermediate text structuring across diverse tasks, while T2S-Bench benchmark evaluates and improves text-to-structure capabilities with comprehensive scientific domain coverage.
Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark (2026)
- RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension (2026)
- ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios (2026)
- N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs (2026)
- UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop (2026)
- CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning (2026)
- From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/t2s-bench-structure-of-thought-benchmarking-and-prompting-comprehensive-text-to-structure-reasoning-1876-179c6b06
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper