Find a representative benchmark for a tool-using or web-browsing agent project
Cite the trajectory-based evaluation taxonomy in a research paper or grant
Compare which benchmarks score final outcomes versus full trajectories
Use the five design principles as a checklist when proposing a new agent benchmark
No code to install; the repo is a README with a benchmark catalogue and a link to the arXiv paper.
This repository accompanies a research paper called Interactive Evaluation Requires a Design Science. It is not a piece of software you install. It is a position paper plus a curated list of benchmarks, and the README lays out the argument and the supporting material. The authors include researchers from several universities, and the paper is hosted on arXiv. The main claim is that the field of AI evaluation needs a more careful way of measuring systems that act over time, not just systems that produce a single answer. When a model uses tools, browses the web, talks to a user, or coordinates with other agents, earlier actions change what later evidence looks like. The authors argue that simply scoring the final output misses most of what is happening, so evaluation should treat the whole trajectory of actions and observations as the evidence to be judged. To organise this, the README proposes a two axis taxonomy. One axis covers what the system interacts with, such as tools and environments, human users, other agents, or hybrid setups with persistent state across sessions. The other axis covers what the evaluation is actually trying to measure: final task success, process quality and efficiency, recoverability under errors, and safety and social behaviour. The authors say current benchmarks mostly record full trajectories but still only score the final outcome, which leaves a gap. The paper sets out five design principles for building interactive evaluations. These include clearly specifying the system and the trajectory evidence, documenting the interaction protocol like a dataset, designing tests that perturb the system and check whether it can repair itself, reporting outcome and process and risk separately, and building shared infrastructure without locking the field into one fixed design. The rest of the README is a curated catalogue of 55 benchmarks grouped into three stages. Stage 1 covers response centered tests like SQuAD, MMLU, GSM8K, HumanEval, and Chatbot Arena. Stage 2 covers task driven benchmarks like SWE-Bench, GAIA, ToolBench, and TravelPlanner. A Stage 3 list continues with more recent agentic and interactive benchmarks. Each entry links to its paper, and the list is kept up to date.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.