clawgym/clawgym-bench

Analysis updated 2026-06-24

★ 11PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((ClawGym-Bench))
    Inputs
      User instructions
      Mock workspaces
      Benchmark JSONL
    Outputs
      Task scores
      Code check results
      Rubric judgments
    Use Cases
      Evaluate agents
      Compare models
      Test workspace tools
    Tech Stack
      Python
      OpenClaw
      sglang
      vllm

mindmap root((ClawGym-Bench)) Inputs User instructions Mock workspaces Benchmark JSONL Outputs Task scores Code check results Rubric judgments Use Cases Evaluate agents Compare models Test workspace tools Tech Stack Python OpenClaw sglang vllm

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Evaluate a Claw-style agent across 200 diverse workspace tasks

USE CASE 2

Compare hosted vs local model performance on agent benchmarks

USE CASE 3

Test new agent frameworks against code-verified task suites

USE CASE 4

Run rubric-plus-code hybrid scoring on agent outputs

What is it built with?

PythonOpenClawsglangvllmHuggingFace

How does it compare?

	clawgym/clawgym-bench	2arons/llm-cli	an1x3r/anima-artist-mixer
Stars	11	11	11
Language	Python	Python	Python
Setup difficulty	hard	easy	easy
Complexity	4/5	2/5	2/5
Audience	researcher	developer	designer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires installing OpenClaw and deploying a served model via sglang or vllm before any tasks run.

In plain English

ClawGym-Bench is a benchmark for evaluating agents that run in the Claw-style framework. It contains 200 test instances. Each instance gives the agent a user instruction, a mock workspace with files and resources, and a verifier that checks whether the task was completed correctly. The dataset is available on Hugging Face under the RUC-AIBOX organization. Of the 200 tasks, 156 are scored entirely by code-based checks that run after the agent finishes. The remaining 44 use a hybrid score that combines code checks with a rubric-based judgment, with code checks weighted at 0.7 and the rubric at 0.3. The benchmark was put together through difficulty-aware filtering followed by a review pass that combines human reviewers and a language model. The tasks fall into six categories covering different kinds of workspace work. Product and collaboration tasks make up 44 of the 200, systems and automation tasks 42, analysis and reasoning tasks 35, content and domain tasks 28, planning and knowledge tasks 26, and software development tasks 25. Each category is described as workspace-grounded, meaning the agent has to interact with files and tools in a mock workspace rather than answer in chat. Running an evaluation requires OpenClaw, an agent framework the README installs with a single curl command. After OpenClaw is installed the user deploys a model, either a hosted one or a local model served by sglang or vllm. The benchmark data lives at data/benchmark_data.jsonl, and a shell script under evaluation/localclawbench/scripts launches the run. The code checker for each task is shipped inside an input_files folder at reward/test.py, but it is hidden from the agent during execution and only used afterwards. The README explains this is to avoid reward hacking, where the agent peeks at the test to game the score. The repository itself is mostly evaluation scripts and dataset references.

Copy-paste prompts

Prompt 1

Walk me through running ClawGym-Bench end-to-end on a local vllm-served model

Prompt 2

Show me how the hybrid 0.7 code plus 0.3 rubric scoring is computed in this repo

Prompt 3

Help me write a new task instance compatible with the ClawGym-Bench schema

Prompt 4

Explain how reward/test.py is hidden from the agent during execution

Prompt 5

Generate a script that aggregates ClawGym-Bench results by category

Frequently asked questions

What is clawgym-bench?

A 200-task benchmark for evaluating Claw-style agents on workspace tasks, using code checks and hybrid rubric scoring across six categories.

What language is clawgym-bench written in?

Mainly Python. The stack also includes Python, OpenClaw, sglang.

How hard is clawgym-bench to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is clawgym-bench for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.