Benchmark a frontier desktop agent against 1,000 tasks across 33 real applications.
Generate new evaluation tasks for an app by writing a verifier and letting the synthesis loop propose goals.
Repair drifting verifiers using the self-evolving layer when execution feedback shows mismatch.
Run an evaluation in an E2B sandbox, local Docker, or a remote Docker fleet on AWS or Tencent Cloud.
Needs Docker or an E2B account, the prebuilt Ubuntu XFCE template, and API keys for the agent model, so a single end-to-end run is a multi-hour setup.
OpenComputer is a research project for testing AI agents that operate a desktop computer the way a person would: opening apps, clicking buttons, filling in forms, editing documents. The problem the authors describe is that hand-built benchmarks for this kind of agent do not scale, because every task needs its own starter files, screen state, and a custom check to decide if the agent succeeded. OpenComputer automates the generation of both the tasks and the checks. The system has four parts. App-specific verifiers expose programmatic check endpoints that read live state from a real application, using things like the browser's debugging protocol, D-Bus, the LibreOffice UNO interface, AT-SPI, files on disk, or SQLite profile databases. A self-evolving layer repairs those verifiers when execution feedback shows they are wrong. A task generator proposes goals, scores them, matches each one to a verifier, and produces the input files needed, such as CSVs, ODT or ODS documents, images, and project files. An evaluation runner records the full trajectory of an agent's actions and assigns partial credit. The current release covers 33 desktop applications and 1,000 tasks across browsers, office software, creative tools, IDEs, file managers, and chat apps. The README states that programmatic verifiers agreed with human judges more often than an LLM-as-judge setup, especially when correctness depends on small details of application state. It also reports that frontier agents finish few tasks end to end and that open-source models score lower here than on the earlier OSWorld-Verified benchmark. The code runs against three backends: E2B cloud sandboxes, local Docker, or a remote Docker fleet on AWS or Tencent Cloud. All three use the same Ubuntu XFCE image with the app suite preinstalled. To run an evaluation, the user clones the repo, fills in API keys for the agent model and the backend, builds the desktop template, and calls python evaluation/run_eval.py with arguments for app, task, model, and parallelism. Single tasks, all tasks for one app, and resumed runs are all supported. A root CLAUDE.md walks a coding agent through the full synthesis loop. The license is Apache 2.0 and there is an arXiv paper linked from the README.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.