explaingit

othersideai/self-operating-computer

10,248PythonAudience · developerComplexity · 3/5Setup · moderate

TLDR

Self-Operating Computer is a Python framework that lets AI vision models control your computer by looking at the screen and issuing mouse and keyboard actions to complete goals you describe in plain English.

Mindmap

mindmap
  root((repo))
    What it does
      Screen reading
      Mouse and keyboard control
      Goal completion
      Multi-model support
    Tech Stack
      Python
      GPT-4o Claude Gemini
      Ollama local models
      OCR mode
    Use Cases
      Task automation
      UI testing
      Agent prototypes
    Audience
      Developers
      AI experimenters
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Automate repetitive desktop tasks by describing them in plain English and letting an AI agent execute the steps

USE CASE 2

Test a web or desktop app by giving an AI agent a user flow to complete and watching it navigate the real UI

USE CASE 3

Prototype a computer-use assistant that can open apps, search the web, and fill forms without writing automation scripts

USE CASE 4

Run computer-control tasks using a local open-source model via Ollama instead of cloud API keys

Tech stack

PythonGPT-4oClaude 3GeminiLLaVaOllama

Getting it running

Difficulty · moderate Time to first run · 30min

Requires an API key for the chosen vision model, Mac users must grant screen recording and accessibility permissions in System Preferences.

In plain English

Self-Operating Computer is a Python framework that lets AI models control a real computer the same way a human would: by looking at the screen and deciding what to click or type. You give it a goal in plain English, such as "open the browser and search for the weather in London", and the AI takes screenshots, figures out where things are on screen, and issues mouse and keyboard actions to complete the task. The system connects to vision-capable AI models to do its work. By default it uses GPT-4o, but it also supports Google Gemini Pro Vision, Claude 3, Qwen-VL, and a locally-run open-source model called LLaVa via Ollama. Each model looks at a screenshot of your screen and decides what action to take next. Installation is a single pip command, and you start it by typing the word operate in your terminal. Several modes change how the AI identifies where to click. The default OCR mode uses text recognition to build a map of clickable elements and their positions, which the README describes as the most accurate approach. A Set-of-Mark mode uses a small object-detection model to label buttons and interface elements directly on the screenshot. There is also a voice input option that lets you speak your objective rather than type it. The framework was released in November 2023 and the README describes it as one of the first public examples of an AI system doing full computer control. It works on Mac, Windows, and Linux. On Mac, you need to grant the Terminal app screen recording and accessibility permissions in System Preferences before it can see your screen or move the mouse. The project requires an API key for whichever AI model you choose to use. It is open source and accepts contributions through the GitHub repository.

Copy-paste prompts

Prompt 1
I installed Self-Operating Computer and want it to open my browser, go to a Wikipedia article, and copy the first paragraph into a text file. What goal should I type and how do I run it?
Prompt 2
How do I configure Self-Operating Computer to use Claude 3 as the vision model instead of the default GPT-4o?
Prompt 3
Write me a goal description for Self-Operating Computer that will make it open a CSV file in a spreadsheet app, find the largest value in column B, and paste it into a notes app.
Prompt 4
Self-Operating Computer says it cannot see my screen on Mac. Which permissions do I need to grant in System Preferences to allow screen recording and mouse control?
Prompt 5
Can I run Self-Operating Computer with a local model using Ollama instead of paying for an API? How do I set that up?
Open on GitHub → Explain another repo

← othersideai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.