MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
Abstract
MuSR, a neurosymbolically generated dataset for evaluating multistep reasoning in language models, highlights challenges and gaps in techniques like chain-of-thought prompting.
While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction (2025)
- BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models (2025)
- From What to Why: Thought-Space Recommendation with Small Language Models (2025)
- Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective (2025)
- Roundtable Policy: Improving Scientific Reasoning and Narratives through Confidence-Weighted Consensus of LLMs (2025)
- StyleBench: Evaluating thinking styles in Large Language Models (2025)
- Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend