Write browser automation tests using plain English descriptions instead of CSS selectors that break when the page changes
Automate Android or iOS app testing with natural-language instructions without learning platform-specific tooling
Extract structured data from any web page by describing in plain text what information you want
Try browser automation interactively through the Chrome extension without writing any code first
Requires an API key for a vision AI model such as Qwen3-VL or Gemini, self-hosting UI-TARS avoids cloud costs but requires GPU resources.
Midscene is a TypeScript library that lets you automate web browsers, Android devices, and iOS devices using plain-language instructions instead of code that points to specific HTML elements. You describe what you want to do in natural language ("click the login button" or "fill in the username field"), and Midscene figures out what to interact with by looking at a screenshot of the screen. The core idea is that it uses visual AI models to locate elements from screenshots rather than reading the page's HTML structure. This approach works on anything visible on screen, including web pages, mobile apps, desktop applications, and HTML canvas surfaces. Supported AI models include Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS, which is an open-source model from ByteDance that can be self-hosted. For developers, the library offers three types of API calls: interaction methods for clicking, typing, and navigating, data extraction methods for pulling structured information out of a page, and utility functions like assertions and element locators. It integrates with existing browser automation tools Puppeteer and Playwright, and it also has a Bridge Mode for controlling a desktop browser session without writing a full automation script. Android support uses ADB, and iOS support uses WebDriverAgent. A Chrome extension is available for trying out automation without writing any code. YAML is supported as an alternative to JavaScript for writing automation scripts, which may be more accessible for non-developers. A caching system replays scripts faster on subsequent runs by skipping the AI reasoning step when the page has not changed. The project is licensed under MIT and maintained by the web infrastructure team at ByteDance. Community SDK ports exist for Python and Java.
← web-infra-dev on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.