keyangds/interactive_evaluation

Analysis updated 2026-06-24

★ 12Audience · researcherComplexity · 1/5LicenseSetup · easy

Mindmap

mindmap
  root((interactive_evaluation))
    Inputs
      Position paper arXiv 2605.17829
      Curated benchmark list
      Two axis taxonomy
    Outputs
      Five design principles
      Stage 1 2 3 benchmark tables
      Citation bibtex
    Use Cases
      Reading the framework
      Picking benchmarks for agents
      Comparing trajectory metrics
    Tech Stack
      Markdown
      Arxiv
      Bibtex

mindmap root((interactive_evaluation)) Inputs Position paper arXiv 2605.17829 Curated benchmark list Two axis taxonomy Outputs Five design principles Stage 1 2 3 benchmark tables Citation bibtex Use Cases Reading the framework Picking benchmarks for agents Comparing trajectory metrics Tech Stack Markdown Arxiv Bibtex

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Find a representative benchmark for a tool-using or web-browsing agent project

USE CASE 2

Cite the trajectory-based evaluation taxonomy in a research paper or grant

USE CASE 3

Compare which benchmarks score final outcomes versus full trajectories

USE CASE 4

Use the five design principles as a checklist when proposing a new agent benchmark

What is it built with?

MarkdownArxivBibtex

How does it compare?

	keyangds/interactive_evaluation	89171/web3-101	abiodundotdo/termframe
Stars	12	12	12
Language	—	—	Shell
Setup difficulty	easy	easy	easy
Complexity	1/5	1/5	2/5
Audience	researcher	general	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

No code to install, the repo is a README with a benchmark catalogue and a link to the arXiv paper.

MIT license allowing free use, modification, and distribution with attribution.

In plain English

This repository accompanies a research paper called Interactive Evaluation Requires a Design Science. It is not a piece of software you install. It is a position paper plus a curated list of benchmarks, and the README lays out the argument and the supporting material. The authors include researchers from several universities, and the paper is hosted on arXiv. The main claim is that the field of AI evaluation needs a more careful way of measuring systems that act over time, not just systems that produce a single answer. When a model uses tools, browses the web, talks to a user, or coordinates with other agents, earlier actions change what later evidence looks like. The authors argue that simply scoring the final output misses most of what is happening, so evaluation should treat the whole trajectory of actions and observations as the evidence to be judged. To organise this, the README proposes a two axis taxonomy. One axis covers what the system interacts with, such as tools and environments, human users, other agents, or hybrid setups with persistent state across sessions. The other axis covers what the evaluation is actually trying to measure: final task success, process quality and efficiency, recoverability under errors, and safety and social behaviour. The authors say current benchmarks mostly record full trajectories but still only score the final outcome, which leaves a gap. The paper sets out five design principles for building interactive evaluations. These include clearly specifying the system and the trajectory evidence, documenting the interaction protocol like a dataset, designing tests that perturb the system and check whether it can repair itself, reporting outcome and process and risk separately, and building shared infrastructure without locking the field into one fixed design. The rest of the README is a curated catalogue of 55 benchmarks grouped into three stages. Stage 1 covers response centered tests like SQuAD, MMLU, GSM8K, HumanEval, and Chatbot Arena. Stage 2 covers task driven benchmarks like SWE-Bench, GAIA, ToolBench, and TravelPlanner. A Stage 3 list continues with more recent agentic and interactive benchmarks. Each entry links to its paper, and the list is kept up to date.

Copy-paste prompts

Prompt 1

Summarise the two axis taxonomy from interactive_evaluation so I can apply it to my multi agent web browsing project.

Prompt 2

From the Stage 2 table in interactive_evaluation, list the tool use benchmarks and what each one actually measures.

Prompt 3

Translate the five design principles from interactive_evaluation into a checklist I can use to review a new agent benchmark proposal.

Prompt 4

Help me draft a related work paragraph that cites interactive_evaluation alongside SWE-Bench, GAIA, and TravelPlanner.

Frequently asked questions

What is interactive_evaluation?

Companion repo for an arXiv position paper that argues AI evaluation needs trajectory-level methods, plus a curated catalogue of 55 benchmarks across response-centered, task-driven, and interactive stages.

What license does interactive_evaluation use?

MIT license allowing free use, modification, and distribution with attribution.

How hard is interactive_evaluation to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is interactive_evaluation for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.