Evaluate a Claw-style agent across 200 diverse workspace tasks
Compare hosted vs local model performance on agent benchmarks
Test new agent frameworks against code-verified task suites
Run rubric-plus-code hybrid scoring on agent outputs
Requires installing OpenClaw and deploying a served model via sglang or vllm before any tasks run.
ClawGym-Bench is a benchmark for evaluating agents that run in the Claw-style framework. It contains 200 test instances. Each instance gives the agent a user instruction, a mock workspace with files and resources, and a verifier that checks whether the task was completed correctly. The dataset is available on Hugging Face under the RUC-AIBOX organization. Of the 200 tasks, 156 are scored entirely by code-based checks that run after the agent finishes. The remaining 44 use a hybrid score that combines code checks with a rubric-based judgment, with code checks weighted at 0.7 and the rubric at 0.3. The benchmark was put together through difficulty-aware filtering followed by a review pass that combines human reviewers and a language model. The tasks fall into six categories covering different kinds of workspace work. Product and collaboration tasks make up 44 of the 200, systems and automation tasks 42, analysis and reasoning tasks 35, content and domain tasks 28, planning and knowledge tasks 26, and software development tasks 25. Each category is described as workspace-grounded, meaning the agent has to interact with files and tools in a mock workspace rather than answer in chat. Running an evaluation requires OpenClaw, an agent framework the README installs with a single curl command. After OpenClaw is installed the user deploys a model, either a hosted one or a local model served by sglang or vllm. The benchmark data lives at data/benchmark_data.jsonl, and a shell script under evaluation/localclawbench/scripts launches the run. The code checker for each task is shipped inside an input_files folder at reward/test.py, but it is hidden from the agent during execution and only used afterwards. The README explains this is to avoid reward hacking, where the agent peeks at the test to game the score. The repository itself is mostly evaluation scripts and dataset references.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.