explaingit

zai-org/open-autoglm

25,311PythonAudience · developerComplexity · 4/5MaintainedLicenseSetup · hard

TLDR

Open-source AI phone agent that automates Android and iOS tasks by understanding screenshots and executing taps, swipes, and text input based on plain-language instructions.

Mindmap

mindmap
  root((repo))
    What it does
      Automates phone tasks
      Understands screenshots
      Executes actions
    How it works
      Connects via ADB/HDC
      Vision-language model
      Plans and executes
    Supported platforms
      Android devices
      HarmonyOS phones
      iOS with setup
    Use cases
      Repetitive automation
      App testing
      Agent research
    Tech stack
      Python framework
      vLLM inference
      SGLang support

Things people build with this

USE CASE 1

Automate repetitive smartphone tasks like opening apps and searching without manual taps.

USE CASE 2

Test mobile apps by having the AI interact with screens and verify expected behavior.

USE CASE 3

Research and develop AI phone agents by experimenting with vision-language models on real devices.

Tech stack

PythonAutoGLM-Phone-9BvLLMSGLangADBHDCWebDriverAgent

Getting it running

Difficulty · hard Time to first run · 1day+

Requires running a local LLM (AutoGLM-Phone-9B via vLLM), Android/iOS device setup (ADB/HDC), and WebDriverAgent infrastructure; multiple moving parts with GPU/CUDA likely needed.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

Open-AutoGLM is an open-source AI phone agent framework built on the AutoGLM model. The problem it solves is that most smartphone tasks still require you to tap through menus yourself. This project lets you describe a task in plain language, such as "open Meituan and search for nearby hotpot restaurants", and the AI automatically figures out what to do on your phone's screen and does it for you. The system works by connecting to your Android or HarmonyOS phone via ADB (Android Debug Bridge, a standard developer tool for communicating with Android devices) or HDC (the equivalent for Huawei HarmonyOS). It takes screenshots of your screen, uses a vision-language model to understand what is shown, plans a sequence of actions, and then executes taps, swipes, and text input on your behalf. It includes a confirmation step for sensitive actions and supports remote control over Wi-Fi. The AI model (AutoGLM-Phone-9B) can be run via third-party API services or self-hosted using vLLM or SGLang inference frameworks. You would use this for automating repetitive phone tasks, app testing, or research into AI phone agents. The framework supports iOS as well as Android, though iOS setup requires additional configuration via WebDriverAgent. It is written in Python and intended for research and educational use.

Copy-paste prompts

Prompt 1
How do I set up Open-AutoGLM to control my Android phone and automate a task like opening an app and searching for something?
Prompt 2
Show me how to use the AutoGLM-Phone-9B model with vLLM to process screenshots and generate phone actions.
Prompt 3
What's the difference between using ADB for Android and HDC for HarmonyOS in Open-AutoGLM, and how do I configure each?
Prompt 4
How can I add a confirmation step for sensitive actions in Open-AutoGLM before the AI executes them on my phone?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.