Coding
10 tasks covering bug fixing, algorithm implementation, data structure design, and concurrency handling.
10
What It Tests
The Coding family tests an agent's ability to write correct, efficient code that handles edge cases. Tasks range from simple bug fixes (empty list handling, string reversal) to complex data structure design (LRU cache) and concurrency (async race conditions). All code is executed in a sandboxed subprocess with a 10-second timeout and 256 MB memory limit.
How It's Scored
Tasks use automated scoring via sandboxed code execution. The agent's response is parsed for Python code blocks, executed with unit tests that verify correctness. A task passes if all assertions succeed. Code in markdown fenced blocks is automatically extracted.
Skills & Tags
All Tasks (10)
Complete list of tasks in this benchmark family with evaluation criteria.
coding-001EasyFix IndexError on empty list
Fix the bug in this Python function that causes an IndexError when the input list is empty.
Unit test: assert get_first([]) is None and get_first([1,2]) == 1
coding-002EasyReverse a string without built-in reverse
Write a Python function that reverses a string without using the built-in reversed() function or slice notation [::-1].
Unit test: assert reverse_string('hello') == 'olleh' and reverse_string('') == ''
coding-003MediumFlatten a nested list
Write a Python function that flattens a nested list of arbitrary depth into a single flat list.
Unit test: assert flatten([1, [2, [3, 4], 5], 6]) == [1, 2, 3, 4, 5, 6]
coding-004EasyTwo Sum problem
Write a Python function two_sum(nums, target) that returns the indices of two numbers that add up to target.
Unit test: assert two_sum([2,7,11,15], 9) == [0,1]
coding-005MediumEfficient Fibonacci
Write a Python function fib(n) that returns the nth Fibonacci number. Must handle n=50 efficiently (no exponential time).
Unit test: assert fib(0) == 0 and fib(50) == 12586269025
coding-006MediumImplement LRU cache
Implement a class LRUCache with get(key) and put(key, value). Both must run in O(1) average time.
Unit test: verifies eviction of least recently used key, O(1) operations
coding-007MediumParse CSV without libraries
Write a Python function parse_csv(text) that handles quoted fields with commas, newlines, and escaped double quotes.
Unit test: verifies comma-in-quotes, escaped quotes, embedded newlines
coding-008HardLongest common subsequence
Write a Python function lcs(a, b) that returns the longest common subsequence. Must run in O(n*m) time.
Unit test: verifies LCS length and that result is a valid subsequence of both inputs
coding-009HardSliding window rate limiter
Implement a RateLimiter class that allows at most max_requests calls within any sliding window of window_seconds.
Unit test: verifies rate limiting, expiry, and cleanup of old entries
coding-010HardFix race condition in async code
Fix the race condition in an async transfer() function where concurrent transfers can cause lost updates.
Criteria: uses asyncio.Lock, lock held through read-check-write, total balance preserved, correct return values