ToolCUA is a research project for training AI agents that can control a desktop computer using both graphical interface actions (clicking, typing, scrolling) and higher-level tool calls (API-based file or application operations). The challenge it addresses is that simply giving an AI agent access to both types of actions does not make it reliable, the agent needs to learn when to use the mouse and keyboard, when to call a tool directly, and when to switch back, without getting confused between the two. The project introduces a three-stage training pipeline to teach this decision-making. First, it generates training data by synthesizing tool calls into existing datasets of GUI-only actions. Second, it uses a technique called Tool-Bootstrapped GUI RFT (a form of reinforcement fine-tuning) to teach the agent when to invoke tools and when to keep using the interface. Third, it applies Online Agentic Reinforcement Learning with a reward signal that encourages both task success and efficient paths, completing tasks in fewer steps. The resulting model, ToolCUA-8B (based on Qwen3-VL-8B), is evaluated on OSWorld-MCP, a benchmark of desktop tasks, and improves accuracy by 18 percentage points over its baseline while completing tasks in fewer steps on average. The repository includes the trained 8B model on Hugging Face, evaluation scripts, and benchmark data. It is a Python project and requires vLLM for serving the model. The full README is longer than what was provided.
← x-plug on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.