explaingit

smyan1909/somatic

17PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

A tool that lets AI agents control a desktop by detecting and numbering every clickable element on screen, so the agent can say click element 3 instead of guessing pixel coordinates.

Mindmap

mindmap
  root((SoMatic))
    What it does
      Element detection
      Numbered annotations
      Desktop control
    Actions
      Click and type
      Scroll and drag
      Screenshots
    Integrations
      Claude Code MCP
      Cursor MCP
      Xvfb headless
    Performance
      Higher accuracy
      Vs baseline screenshots
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Wire SoMatic into Claude Code or Cursor via MCP to let an AI agent click through any desktop GUI automatically.

USE CASE 2

Automate desktop workflows on Linux by having an agent detect screen elements and issue numbered click or type commands.

USE CASE 3

Run headless GUI automation inside an Xvfb virtual desktop so automated tasks do not disturb your real screen.

Tech stack

PythonnpmMCPXvfb

Getting it running

Difficulty · moderate Time to first run · 30min

Requires Python 3.10+ and npm, downloads AGPL-licensed computer-vision weights at first run.

Core code is free to use for any purpose, the bundled computer-vision model weights are AGPL-3.0, requiring source sharing if you distribute software that includes them.

In plain English

SoMatic is a command-line tool that lets AI agents control a desktop computer by clicking, typing, scrolling, and pressing keys. The core idea is to give agents a reliable way to locate things on screen. Rather than having the agent guess pixel coordinates, SoMatic runs a computer-vision model that scans each screenshot and draws a numbered label on every interactive element it finds. The agent then says "click element 3" or "type text at element 12" and SoMatic handles the actual input. Every command returns structured JSON output, which makes it straightforward for agents to parse results and decide what to do next. SoMatic supports all the standard desktop actions: single and double clicks, right-clicks, drags, scrolls, key presses, and text entry. It can also take screenshots with the numbered annotations baked in, so an agent always has a current view of what is on screen. The tool installs via npm and uses Python for its core. It includes an MCP server, which is a standard connection format that allows tools like Claude Code and Cursor to wire SoMatic in as a built-in capability without any extra prompting. On Linux, it also supports running inside a virtual desktop (Xvfb), so automated tasks can run without disturbing your real screen. Benchmarks included in the repository show that combining SoMatic's element-detection output with a capable language model reaches around 68 to 78 percent accuracy on two standard GUI-automation test sets, compared to 52 to 60 percent when the model works from screenshots alone without any detection hints. The core code is licensed under MIT. The computer-vision weights it downloads at first run are licensed under AGPL-3.0, which the project keeps separate from its own code to avoid AGPL obligations on the published package. Python 3.10 or newer is required.

Copy-paste prompts

Prompt 1
I have installed SoMatic via npm and want to connect it to Claude Code as an MCP server. Walk me through the MCP config entry I need to add so Claude Code can use SoMatic to click elements on my desktop.
Prompt 2
I am building an AI agent that uses SoMatic to fill a form in a desktop app. Show me the JSON commands the agent should send to: take an annotated screenshot, click element 5, and type hello world at the focused input.
Prompt 3
I want to run SoMatic in a headless Xvfb virtual desktop on my Linux server so it does not interfere with my real screen. Walk me through setting up Xvfb and pointing SoMatic at the virtual display.
Open on GitHub → Explain another repo

← smyan1909 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.