Reasoning
5 tasks covering logical deduction, math word problems, causal reasoning, and spatial puzzles.
5
What It Tests
The Reasoning family tests an agent's ability to solve problems through logical deduction and multi-step reasoning. Tasks include classic logic puzzles (constraint satisfaction), math word problems requiring algebraic reasoning, causal/counterfactual analysis (distinguishing correlation from causation), and spatial navigation (tracking position and direction through a series of movements).
How It's Scored
Most tasks use automated scoring with JSON output verification. The agent must return structured JSON matching expected values exactly. The causal reasoning task uses rubric scoring to evaluate the quality of explanations and identification of logical fallacies.
Skills & Tags
All Tasks (5)
Complete list of tasks in this benchmark family with evaluation criteria.
reasoning-001MediumBasic logical deduction
Alice, Bob, and Carol each have a different pet (cat, dog, fish). Given constraints, determine who has which pet.
JSON assertion: {"Alice": "dog", "Bob": "cat", "Carol": "fish"}
reasoning-002MediumMulti-step math word problem
Tom buys 4 apples at $2 each and some oranges at $3 each, paying $20 total. How many oranges?
Assert answer is 4
reasoning-003MediumMulti-step word problem
Two-day bakery revenue problem: Monday sales with cupcakes and cookies, Tuesday with modified quantities and prices.
JSON assertion: {"tuesday_revenue": 145}
reasoning-004HardCausal reasoning and counterfactual
Speed cameras installed on Road A reduce accidents 30%, while Road B accidents increase 20%. Evaluate 4 causal claims.
Criteria: identifies correlation≠causation, displacement effect, counterfactual reasoning, magnitude uncertainty
reasoning-005MediumSpatial reasoning puzzle
Starting north-facing, execute a series of turns and steps. Determine final position (x,y) and facing direction.
JSON assertion: {"x": -2, "y": 4, "facing": "north"}