explaingit

yeahhe365/webdroid-agent

11TypeScriptAudience · developerComplexity · 4/5ActiveLicenseSetup · moderate

TLDR

Browser-only React app that drives a USB-connected Android phone with a vision LLM, screenshotting, asking for one JSON action per turn, and running it through WebADB.

Mindmap

mindmap
  root((WebDroid-Agent))
    Inputs
      Android phone over USB
      OpenAI-compatible vision model
      User task prompt
    Outputs
      ADB actions
      Action log
      Screenshots
    Use Cases
      Test vision agent on a real phone
      Automate Android UI flows
      Demo browser-only ADB control
    Tech Stack
      React
      TypeScript
      Vite
      WebADB

Things people build with this

USE CASE 1

Drive an Android phone with a vision LLM straight from a Chrome tab without installing anything server-side

USE CASE 2

Prototype mobile UI automation tasks like opening Settings or Wi-Fi from natural language

USE CASE 3

Test how well a vision model parses Android screenshots and emits tap and swipe actions

USE CASE 4

Demo a browser-only WebADB and WebUSB flow that needs no backend

Tech stack

ReactTypeScriptViteWebADBWebUSB

Getting it running

Difficulty · moderate Time to first run · 30min

Need a Chromium browser with WebUSB, an Android phone in USB debug mode, and access to an OpenAI-compatible vision model endpoint.

MIT lets you use, modify, and ship this in commercial or closed products as long as you keep the copyright notice.

In plain English

WebDroid Agent is a browser-only experiment for driving a real Android phone with a vision-capable language model. The whole app runs as a static frontend with no server, opens a USB connection to a phone through WebUSB and WebADB, takes screenshots, sends them to an OpenAI-compatible chat completions endpoint, then runs the action the model returns through ADB. The author treats it as a quick way to test the vision-model-plus-phone loop, not a long-running phone assistant. The flow is simple. The user opens the app in Chrome or Edge, connects an Android device with USB debugging turned on, fills in the OpenAI-compatible base URL, API key, and model name, and types a task like open Settings and go to the Wi-Fi page. The app takes a screenshot, sends it to the model, parses the reply, validates it, and either runs the action or waits for confirmation. The loop repeats until the model returns done, asks the human to take over, hits the maximum step count, or the user stops it. The model is asked to reply with one JSON object per turn. The standard action set includes launch, tap, swipe, input_text, key, back, home, long_press, double_tap, wait, take_over, interact, note, call_api, and done. The parser also accepts Open-AutoGLM style names with a 0 to 1000 relative coordinate space and function-style outputs. Pixel coordinates are mapped back to the device's native resolution before execution. Safety checks happen entirely in the browser. Model output must parse into a supported action, coordinates are bounds-checked, text inputs are length-limited and reject control characters, the run has a step cap, sensitive taps can require human confirmation, and the user can stop at any time. The author warns against using it for logins, payments, deletions, account settings, or anything that needs a password or verification code. The stack is React 19, TypeScript 6, and Vite 8, with WebADB based on the Tango library. A live demo is hosted on Cloudflare Pages. Settings including the API key are stored in browser localStorage. The project is MIT licensed.

Copy-paste prompts

Prompt 1
Walk me through enabling USB debugging on an Android phone and connecting it to WebDroid Agent in Chrome
Prompt 2
Set up WebDroid Agent against my local Ollama vision model exposed as an OpenAI-compatible endpoint
Prompt 3
Explain the JSON action schema WebDroid Agent expects and add a new custom action for screen rotate
Prompt 4
Audit the safety checks in WebDroid Agent and suggest extra guards before allowing taps on payment screens
Prompt 5
Compare WebDroid Agent to Appium for vision-driven Android automation in a CI pipeline
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.