Wire SoMatic into Claude Code or Cursor via MCP to let an AI agent click through any desktop GUI automatically.
Automate desktop workflows on Linux by having an agent detect screen elements and issue numbered click or type commands.
Run headless GUI automation inside an Xvfb virtual desktop so automated tasks do not disturb your real screen.
Requires Python 3.10+ and npm, downloads AGPL-licensed computer-vision weights at first run.
SoMatic is a command-line tool that lets AI agents control a desktop computer by clicking, typing, scrolling, and pressing keys. The core idea is to give agents a reliable way to locate things on screen. Rather than having the agent guess pixel coordinates, SoMatic runs a computer-vision model that scans each screenshot and draws a numbered label on every interactive element it finds. The agent then says "click element 3" or "type text at element 12" and SoMatic handles the actual input. Every command returns structured JSON output, which makes it straightforward for agents to parse results and decide what to do next. SoMatic supports all the standard desktop actions: single and double clicks, right-clicks, drags, scrolls, key presses, and text entry. It can also take screenshots with the numbered annotations baked in, so an agent always has a current view of what is on screen. The tool installs via npm and uses Python for its core. It includes an MCP server, which is a standard connection format that allows tools like Claude Code and Cursor to wire SoMatic in as a built-in capability without any extra prompting. On Linux, it also supports running inside a virtual desktop (Xvfb), so automated tasks can run without disturbing your real screen. Benchmarks included in the repository show that combining SoMatic's element-detection output with a capable language model reaches around 68 to 78 percent accuracy on two standard GUI-automation test sets, compared to 52 to 60 percent when the model works from screenshots alone without any detection hints. The core code is licensed under MIT. The computer-vision weights it downloads at first run are licensed under AGPL-3.0, which the project keeps separate from its own code to avoid AGPL obligations on the published package. Python 3.10 or newer is required.
← smyan1909 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.