tianshi-xu/life-harness

★ 20PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((life-harness))
    What it does
      Runtime interface adaptation
      No model retraining
      Reuses recovery patterns
    Benchmarks covered
      Web shopping
      Database querying
      OS interaction
      Household navigation
    Tech Stack
      Python
      Docker
      uv package manager
    Use Cases
      AI agent research
      Benchmark reproduction
      Agent failure analysis

mindmap root((life-harness)) What it does Runtime interface adaptation No model retraining Reuses recovery patterns Benchmarks covered Web shopping Database querying OS interaction Household navigation Tech Stack Python Docker uv package manager Use Cases AI agent research Benchmark reproduction Agent failure analysis

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Improve a frozen AI model's performance on tasks like web shopping or database querying without retraining it, just by adapting the runtime interface layer.

USE CASE 2

Reproduce the Life-Harness arXiv 2026 paper results by running AgentBench or tau-bench tasks with your own AI model API keys.

USE CASE 3

Study how storing and reusing successful agent recovery patterns from past runs prevents models from repeating the same failures.

USE CASE 4

Test how different AI model backbones respond to runtime adaptations across seven benchmark task families.

Tech stack

PythonDockeruv

Getting it running

Difficulty · hard Time to first run · 1h+

Requires Docker for AgentBench tasks and the uv Python environment manager for tau-bench tasks, plus your own API keys for each AI model backend you want to test.

No license information is mentioned in this repository.

In plain English

Life-Harness is a research system that improves how AI agents perform on tasks without changing the AI model itself. The core idea is that when a frozen AI model repeatedly fails at a task, you do not need to retrain it. Instead, you can adapt the layer of code that sits between the model and the task environment, which the researchers call the runtime interface. The system observes where an agent fails, then adds lightweight runtime adjustments in four areas: how model decisions are translated into actions the environment can execute, how the task's rules and constraints are made explicit to the model, how multi-step interaction sequences are regulated to prevent the model from repeating the same failure, and how successful recovery patterns from past runs are stored and reused. None of these changes touch the model's internal weights, and the benchmark environments used for testing remain unmodified. The results reported in the paper cover seven different task benchmarks, ranging from household navigation and web shopping to database querying and operating system interaction. Across eighteen different AI model backbones, Life-Harness improved performance in 116 out of 126 model-environment combinations, with an average relative gain of 88.5%. The method requires no training. The repository is structured in two parts matching two families of benchmark tasks: AgentBench-style tasks (which use Docker containers) and tau-bench-style tasks (which use a Python environment manager called uv). Each subfolder contains its own setup instructions. Users need to supply their own API keys for whatever AI model they want to test. The code accompanies a paper published on arXiv in 2026.

Copy-paste prompts

Prompt 1

Explain how Life-Harness adapts the runtime interface between an AI model and a task environment in four areas, and why this beats retraining the model when it repeatedly fails.

Prompt 2

I want to test Life-Harness on a web shopping benchmark with GPT-4. Walk me through setting up the AgentBench-style Docker environment and configuring my API key.

Prompt 3

Using the Life-Harness approach, design a runtime adjustment for an AI agent that keeps failing at multi-step database querying, specifically the part that regulates interaction sequences to prevent repeated failure.

Prompt 4

How does Life-Harness store and reuse successful recovery patterns from past agent runs? Explain the mechanism and how I would apply it to a household navigation task.

Prompt 5

I want to benchmark Life-Harness across three different AI model backbones on tau-bench tasks. What setup steps do I need and what metrics should I track to compare against the paper's results?

Open on GitHub → Explain another repo

← tianshi-xu on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.