Analysis updated 2026-05-18
Measure what fraction of an AI coding agent's token spend went to dead-end attempts that contributed nothing to the final answer.
Detect runs where an agent passed its own internal checks but failed an external test evaluator, revealing confidently wrong work.
Generate an HTML report with a branch tree and cost breakdown to understand where an agent wastes the most compute.
| wisoba/deadbranchbench | 0xhassaan/nn-from-scratch | a-little-hoof/dsr | |
|---|---|---|---|
| Stars | 0 | 0 | 0 |
| Language | Python | Python | Python |
| Setup difficulty | moderate | moderate | hard |
| Complexity | 3/5 | 4/5 | 5/5 |
| Audience | researcher | developer | researcher |
Figures from each repo's GitHub metadata at analysis time.
Week 1 scope: event schema and CLI only. No pruning or agent intelligence built in yet.
When an AI agent works on a coding task, it tries multiple approaches. Some of those attempts succeed and contribute to the final result, others fail, get discarded, or turn out to have been completely unnecessary. DeadBranchBench is a Python tool for measuring how much of that wasted effort actually costs, in terms of tokens, tool calls, retries, and computation time. The main concept is a "dead branch": a path the agent took that produced no contribution to the final result. The tool records what an agent does as a series of events, organizes them into a tree of branches, and then lets a human reviewer label each branch as live (contributed to the solution), support (failed but produced useful information), deferred (preserved for possible future use), or dead (consumed cost with no measurable output). From those labels, the tool computes metrics like the Dead Branch Ratio, which measures what fraction of the total cost went to dead work. The tool also handles a subtler failure mode: an agent that appeared to succeed by its own internal checks but still failed an external evaluator. A run can have zero dead branches by its own accounting and yet fail a test suite. This distinction matters for benchmarking agents on real tasks where finishing a run is not the same as producing a correct result. In practice you run the tool from a command line. You observe a command or script, capture events as a JSONL file, build a trace skeleton, label the branches interactively, compute the metrics, and optionally export an HTML report showing the branch tree, cost breakdown, and top waste contributors. The tool does not prune branches or optimize the agent, it only measures and reports. This is an early-stage project aimed at researchers and developers who build or evaluate AI coding agents and want objective data on where compute is being wasted.
A benchmarking tool that records AI agent work as events, labels branches as live or dead after review, and measures how much compute cost went to wasted effort.
Mainly Python. The stack also includes Python, JSONL, CLI.
No license information is mentioned in the README.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.