Drive an Android phone with a vision LLM straight from a Chrome tab without installing anything server-side
Prototype mobile UI automation tasks like opening Settings or Wi-Fi from natural language
Test how well a vision model parses Android screenshots and emits tap and swipe actions
Demo a browser-only WebADB and WebUSB flow that needs no backend
Need a Chromium browser with WebUSB, an Android phone in USB debug mode, and access to an OpenAI-compatible vision model endpoint.
WebDroid Agent is a browser-only experiment for driving a real Android phone with a vision-capable language model. The whole app runs as a static frontend with no server, opens a USB connection to a phone through WebUSB and WebADB, takes screenshots, sends them to an OpenAI-compatible chat completions endpoint, then runs the action the model returns through ADB. The author treats it as a quick way to test the vision-model-plus-phone loop, not a long-running phone assistant. The flow is simple. The user opens the app in Chrome or Edge, connects an Android device with USB debugging turned on, fills in the OpenAI-compatible base URL, API key, and model name, and types a task like open Settings and go to the Wi-Fi page. The app takes a screenshot, sends it to the model, parses the reply, validates it, and either runs the action or waits for confirmation. The loop repeats until the model returns done, asks the human to take over, hits the maximum step count, or the user stops it. The model is asked to reply with one JSON object per turn. The standard action set includes launch, tap, swipe, input_text, key, back, home, long_press, double_tap, wait, take_over, interact, note, call_api, and done. The parser also accepts Open-AutoGLM style names with a 0 to 1000 relative coordinate space and function-style outputs. Pixel coordinates are mapped back to the device's native resolution before execution. Safety checks happen entirely in the browser. Model output must parse into a supported action, coordinates are bounds-checked, text inputs are length-limited and reject control characters, the run has a step cap, sensitive taps can require human confirmation, and the user can stop at any time. The author warns against using it for logins, payments, deletions, account settings, or anything that needs a password or verification code. The stack is React 19, TypeScript 6, and Vite 8, with WebADB based on the Tango library. A live demo is hosted on Cloudflare Pages. Settings including the API key are stored in browser localStorage. The project is MIT licensed.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.