explaingit

x-plug/toolcua

19Python
This is a quick first-pass explanation. The richer sections — use-cases, tech stack, setup, prompts — are still being generated.

TLDR

ToolCUA is a research project for training AI agents that can control a desktop computer using both graphical interface actions (clicking, typing, scrolling) and higher-level tool calls (API-based file or application operations).

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

In plain English

ToolCUA is a research project for training AI agents that can control a desktop computer using both graphical interface actions (clicking, typing, scrolling) and higher-level tool calls (API-based file or application operations). The challenge it addresses is that simply giving an AI agent access to both types of actions does not make it reliable, the agent needs to learn when to use the mouse and keyboard, when to call a tool directly, and when to switch back, without getting confused between the two. The project introduces a three-stage training pipeline to teach this decision-making. First, it generates training data by synthesizing tool calls into existing datasets of GUI-only actions. Second, it uses a technique called Tool-Bootstrapped GUI RFT (a form of reinforcement fine-tuning) to teach the agent when to invoke tools and when to keep using the interface. Third, it applies Online Agentic Reinforcement Learning with a reward signal that encourages both task success and efficient paths, completing tasks in fewer steps. The resulting model, ToolCUA-8B (based on Qwen3-VL-8B), is evaluated on OSWorld-MCP, a benchmark of desktop tasks, and improves accuracy by 18 percentage points over its baseline while completing tasks in fewer steps on average. The repository includes the trained 8B model on Hugging Face, evaluation scripts, and benchmark data. It is a Python project and requires vLLM for serving the model. The full README is longer than what was provided.

Open on GitHub → Explain another repo

← x-plug on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.