explaingit

clawgym/clawgym-bench

11PythonAudience · researcherComplexity · 4/5ActiveSetup · hard

TLDR

A 200-task benchmark for evaluating Claw-style agents on workspace tasks, using code checks and hybrid rubric scoring across six categories.

Mindmap

mindmap
  root((ClawGym-Bench))
    Inputs
      User instructions
      Mock workspaces
      Benchmark JSONL
    Outputs
      Task scores
      Code check results
      Rubric judgments
    Use Cases
      Evaluate agents
      Compare models
      Test workspace tools
    Tech Stack
      Python
      OpenClaw
      sglang
      vllm

Things people build with this

USE CASE 1

Evaluate a Claw-style agent across 200 diverse workspace tasks

USE CASE 2

Compare hosted vs local model performance on agent benchmarks

USE CASE 3

Test new agent frameworks against code-verified task suites

USE CASE 4

Run rubric-plus-code hybrid scoring on agent outputs

Tech stack

PythonOpenClawsglangvllmHuggingFace

Getting it running

Difficulty · hard Time to first run · 1day+

Requires installing OpenClaw and deploying a served model via sglang or vllm before any tasks run.

In plain English

ClawGym-Bench is a benchmark for evaluating agents that run in the Claw-style framework. It contains 200 test instances. Each instance gives the agent a user instruction, a mock workspace with files and resources, and a verifier that checks whether the task was completed correctly. The dataset is available on Hugging Face under the RUC-AIBOX organization. Of the 200 tasks, 156 are scored entirely by code-based checks that run after the agent finishes. The remaining 44 use a hybrid score that combines code checks with a rubric-based judgment, with code checks weighted at 0.7 and the rubric at 0.3. The benchmark was put together through difficulty-aware filtering followed by a review pass that combines human reviewers and a language model. The tasks fall into six categories covering different kinds of workspace work. Product and collaboration tasks make up 44 of the 200, systems and automation tasks 42, analysis and reasoning tasks 35, content and domain tasks 28, planning and knowledge tasks 26, and software development tasks 25. Each category is described as workspace-grounded, meaning the agent has to interact with files and tools in a mock workspace rather than answer in chat. Running an evaluation requires OpenClaw, an agent framework the README installs with a single curl command. After OpenClaw is installed the user deploys a model, either a hosted one or a local model served by sglang or vllm. The benchmark data lives at data/benchmark_data.jsonl, and a shell script under evaluation/localclawbench/scripts launches the run. The code checker for each task is shipped inside an input_files folder at reward/test.py, but it is hidden from the agent during execution and only used afterwards. The README explains this is to avoid reward hacking, where the agent peeks at the test to game the score. The repository itself is mostly evaluation scripts and dataset references.

Copy-paste prompts

Prompt 1
Walk me through running ClawGym-Bench end-to-end on a local vllm-served model
Prompt 2
Show me how the hybrid 0.7 code plus 0.3 rubric scoring is computed in this repo
Prompt 3
Help me write a new task instance compatible with the ClawGym-Bench schema
Prompt 4
Explain how reward/test.py is hidden from the agent during execution
Prompt 5
Generate a script that aggregates ClawGym-Bench results by category
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.