explaingit

echo0715/opencomputer

16PythonAudience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

A research benchmark and synthesis loop for desktop AI agents, with 1,000 auto-generated tasks across 33 apps and programmatic verifiers that read real app state.

Mindmap

mindmap
  root((OpenComputer))
    Inputs
      Agent model keys
      Backend choice
      Task and app args
    Outputs
      Action trajectories
      Partial-credit scores
      Generated task files
    Use Cases
      Benchmark desktop agents
      Generate new tasks
      Repair broken verifiers
    Tech Stack
      Python
      Docker
      E2B
      LibreOffice UNO
      AT-SPI

Things people build with this

USE CASE 1

Benchmark a frontier desktop agent against 1,000 tasks across 33 real applications.

USE CASE 2

Generate new evaluation tasks for an app by writing a verifier and letting the synthesis loop propose goals.

USE CASE 3

Repair drifting verifiers using the self-evolving layer when execution feedback shows mismatch.

USE CASE 4

Run an evaluation in an E2B sandbox, local Docker, or a remote Docker fleet on AWS or Tencent Cloud.

Tech stack

PythonDockerE2BUbuntuLibreOffice

Getting it running

Difficulty · hard Time to first run · 1day+

Needs Docker or an E2B account, the prebuilt Ubuntu XFCE template, and API keys for the agent model, so a single end-to-end run is a multi-hour setup.

Apache 2.0 license, you can use, modify, and ship it commercially as long as you keep the notice and state any changes you made.

In plain English

OpenComputer is a research project for testing AI agents that operate a desktop computer the way a person would: opening apps, clicking buttons, filling in forms, editing documents. The problem the authors describe is that hand-built benchmarks for this kind of agent do not scale, because every task needs its own starter files, screen state, and a custom check to decide if the agent succeeded. OpenComputer automates the generation of both the tasks and the checks. The system has four parts. App-specific verifiers expose programmatic check endpoints that read live state from a real application, using things like the browser's debugging protocol, D-Bus, the LibreOffice UNO interface, AT-SPI, files on disk, or SQLite profile databases. A self-evolving layer repairs those verifiers when execution feedback shows they are wrong. A task generator proposes goals, scores them, matches each one to a verifier, and produces the input files needed, such as CSVs, ODT or ODS documents, images, and project files. An evaluation runner records the full trajectory of an agent's actions and assigns partial credit. The current release covers 33 desktop applications and 1,000 tasks across browsers, office software, creative tools, IDEs, file managers, and chat apps. The README states that programmatic verifiers agreed with human judges more often than an LLM-as-judge setup, especially when correctness depends on small details of application state. It also reports that frontier agents finish few tasks end to end and that open-source models score lower here than on the earlier OSWorld-Verified benchmark. The code runs against three backends: E2B cloud sandboxes, local Docker, or a remote Docker fleet on AWS or Tencent Cloud. All three use the same Ubuntu XFCE image with the app suite preinstalled. To run an evaluation, the user clones the repo, fills in API keys for the agent model and the backend, builds the desktop template, and calls python evaluation/run_eval.py with arguments for app, task, model, and parallelism. Single tasks, all tasks for one app, and resumed runs are all supported. A root CLAUDE.md walks a coding agent through the full synthesis loop. The license is Apache 2.0 and there is an arXiv paper linked from the README.

Copy-paste prompts

Prompt 1
Set up echo0715/OpenComputer with local Docker on Ubuntu 22.04. List every step from cloning to running python evaluation/run_eval.py for a single browser task.
Prompt 2
Add a new desktop app verifier to OpenComputer for Inkscape. Show the file layout, a sample check endpoint, and how to plug it into the task generator.
Prompt 3
Run OpenComputer against Claude Sonnet 4.6 on all LibreOffice Calc tasks. Provide the run_eval.py command and parse the output into a Markdown report.
Prompt 4
Compare OpenComputer with OSWorld-Verified. Summarize the methodology differences and why programmatic verifiers beat LLM-as-judge here.
Prompt 5
Debug a failing AT-SPI verifier in OpenComputer that always returns false. List the likely causes and the trajectory replay commands I should run.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.