Automate repetitive desktop tasks by describing them in plain English and letting an AI agent execute the steps
Test a web or desktop app by giving an AI agent a user flow to complete and watching it navigate the real UI
Prototype a computer-use assistant that can open apps, search the web, and fill forms without writing automation scripts
Run computer-control tasks using a local open-source model via Ollama instead of cloud API keys
Requires an API key for the chosen vision model, Mac users must grant screen recording and accessibility permissions in System Preferences.
Self-Operating Computer is a Python framework that lets AI models control a real computer the same way a human would: by looking at the screen and deciding what to click or type. You give it a goal in plain English, such as "open the browser and search for the weather in London", and the AI takes screenshots, figures out where things are on screen, and issues mouse and keyboard actions to complete the task. The system connects to vision-capable AI models to do its work. By default it uses GPT-4o, but it also supports Google Gemini Pro Vision, Claude 3, Qwen-VL, and a locally-run open-source model called LLaVa via Ollama. Each model looks at a screenshot of your screen and decides what action to take next. Installation is a single pip command, and you start it by typing the word operate in your terminal. Several modes change how the AI identifies where to click. The default OCR mode uses text recognition to build a map of clickable elements and their positions, which the README describes as the most accurate approach. A Set-of-Mark mode uses a small object-detection model to label buttons and interface elements directly on the screenshot. There is also a voice input option that lets you speak your objective rather than type it. The framework was released in November 2023 and the README describes it as one of the first public examples of an AI system doing full computer control. It works on Mac, Windows, and Linux. On Mac, you need to grant the Terminal app screen recording and accessibility permissions in System Preferences before it can see your screen or move the mouse. The project requires an API key for whichever AI model you choose to use. It is open source and accepts contributions through the GitHub repository.
← othersideai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.