explaingit

web-infra-dev/midscene

13,011TypeScriptAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

A TypeScript library from ByteDance that automates web browsers and mobile apps using plain-language instructions, you describe what to do in natural language and it finds the right element by looking at a screenshot.

Mindmap

mindmap
  root((midscene))
    What it does
      Natural-language automation
      Screenshot-based AI
      Cross-platform
    Platforms
      Web browsers
      Android via ADB
      iOS via WebDriverAgent
    Integrations
      Puppeteer
      Playwright
      Chrome extension
    Use Cases
      UI testing
      Data extraction
      Form automation
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Write browser automation tests using plain English descriptions instead of CSS selectors that break when the page changes

USE CASE 2

Automate Android or iOS app testing with natural-language instructions without learning platform-specific tooling

USE CASE 3

Extract structured data from any web page by describing in plain text what information you want

USE CASE 4

Try browser automation interactively through the Chrome extension without writing any code first

Tech stack

TypeScriptJavaScriptPuppeteerPlaywrightPythonJava

Getting it running

Difficulty · moderate Time to first run · 30min

Requires an API key for a vision AI model such as Qwen3-VL or Gemini, self-hosting UI-TARS avoids cloud costs but requires GPU resources.

MIT: use, modify, and share freely for any purpose including commercially, with no restrictions beyond keeping the copyright notice.

In plain English

Midscene is a TypeScript library that lets you automate web browsers, Android devices, and iOS devices using plain-language instructions instead of code that points to specific HTML elements. You describe what you want to do in natural language ("click the login button" or "fill in the username field"), and Midscene figures out what to interact with by looking at a screenshot of the screen. The core idea is that it uses visual AI models to locate elements from screenshots rather than reading the page's HTML structure. This approach works on anything visible on screen, including web pages, mobile apps, desktop applications, and HTML canvas surfaces. Supported AI models include Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS, which is an open-source model from ByteDance that can be self-hosted. For developers, the library offers three types of API calls: interaction methods for clicking, typing, and navigating, data extraction methods for pulling structured information out of a page, and utility functions like assertions and element locators. It integrates with existing browser automation tools Puppeteer and Playwright, and it also has a Bridge Mode for controlling a desktop browser session without writing a full automation script. Android support uses ADB, and iOS support uses WebDriverAgent. A Chrome extension is available for trying out automation without writing any code. YAML is supported as an alternative to JavaScript for writing automation scripts, which may be more accessible for non-developers. A caching system replays scripts faster on subsequent runs by skipping the AI reasoning step when the page has not changed. The project is licensed under MIT and maintained by the web infrastructure team at ByteDance. Community SDK ports exist for Python and Java.

Copy-paste prompts

Prompt 1
Write a Midscene Playwright script that logs into a website by describing the username field, password field, and submit button in natural language
Prompt 2
How do I use Midscene to extract a table of product names and prices from an e-commerce page into a JSON object?
Prompt 3
Set up Midscene with the self-hosted UI-TARS model so screenshots stay on my own machine instead of going to a cloud API
Prompt 4
Write a Midscene YAML automation script that fills a contact form, submits it, and asserts the confirmation message appears
Prompt 5
How do I enable Midscene's caching feature so repeated test runs skip the AI reasoning step when the page has not changed?
Open on GitHub → Explain another repo

← web-infra-dev on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.