explaingit

keyangds/interactive_evaluation

12Audience · researcherComplexity · 1/5ActiveLicenseSetup · easy

TLDR

Companion repo for an arXiv position paper that argues AI evaluation needs trajectory-level methods, plus a curated catalogue of 55 benchmarks across response-centered, task-driven, and interactive stages.

Mindmap

mindmap
  root((interactive_evaluation))
    Inputs
      Position paper arXiv 2605.17829
      Curated benchmark list
      Two axis taxonomy
    Outputs
      Five design principles
      Stage 1 2 3 benchmark tables
      Citation bibtex
    Use Cases
      Reading the framework
      Picking benchmarks for agents
      Comparing trajectory metrics
    Tech Stack
      Markdown
      Arxiv
      Bibtex

Things people build with this

USE CASE 1

Find a representative benchmark for a tool-using or web-browsing agent project

USE CASE 2

Cite the trajectory-based evaluation taxonomy in a research paper or grant

USE CASE 3

Compare which benchmarks score final outcomes versus full trajectories

USE CASE 4

Use the five design principles as a checklist when proposing a new agent benchmark

Tech stack

MarkdownArxivBibtex

Getting it running

Difficulty · easy Time to first run · 5min

No code to install; the repo is a README with a benchmark catalogue and a link to the arXiv paper.

MIT license allowing free use, modification, and distribution with attribution.

In plain English

This repository accompanies a research paper called Interactive Evaluation Requires a Design Science. It is not a piece of software you install. It is a position paper plus a curated list of benchmarks, and the README lays out the argument and the supporting material. The authors include researchers from several universities, and the paper is hosted on arXiv. The main claim is that the field of AI evaluation needs a more careful way of measuring systems that act over time, not just systems that produce a single answer. When a model uses tools, browses the web, talks to a user, or coordinates with other agents, earlier actions change what later evidence looks like. The authors argue that simply scoring the final output misses most of what is happening, so evaluation should treat the whole trajectory of actions and observations as the evidence to be judged. To organise this, the README proposes a two axis taxonomy. One axis covers what the system interacts with, such as tools and environments, human users, other agents, or hybrid setups with persistent state across sessions. The other axis covers what the evaluation is actually trying to measure: final task success, process quality and efficiency, recoverability under errors, and safety and social behaviour. The authors say current benchmarks mostly record full trajectories but still only score the final outcome, which leaves a gap. The paper sets out five design principles for building interactive evaluations. These include clearly specifying the system and the trajectory evidence, documenting the interaction protocol like a dataset, designing tests that perturb the system and check whether it can repair itself, reporting outcome and process and risk separately, and building shared infrastructure without locking the field into one fixed design. The rest of the README is a curated catalogue of 55 benchmarks grouped into three stages. Stage 1 covers response centered tests like SQuAD, MMLU, GSM8K, HumanEval, and Chatbot Arena. Stage 2 covers task driven benchmarks like SWE-Bench, GAIA, ToolBench, and TravelPlanner. A Stage 3 list continues with more recent agentic and interactive benchmarks. Each entry links to its paper, and the list is kept up to date.

Copy-paste prompts

Prompt 1
Summarise the two axis taxonomy from interactive_evaluation so I can apply it to my multi agent web browsing project.
Prompt 2
From the Stage 2 table in interactive_evaluation, list the tool use benchmarks and what each one actually measures.
Prompt 3
Translate the five design principles from interactive_evaluation into a checklist I can use to review a new agent benchmark proposal.
Prompt 4
Help me draft a related work paragraph that cites interactive_evaluation alongside SWE-Bench, GAIA, and TravelPlanner.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.