bytedance/ui-tars

★ 10,527PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((UI-TARS))
    What it does
      Understands screens
      Clicks and types
      Multi-step reasoning
    Supported Platforms
      Windows macOS Linux
      Android mobile
      Web browsers
    Tech Stack
      Python
      Vision language model
      Reinforcement learning
    Use Cases
      Task automation
      UI testing
      Research benchmarks

mindmap root((UI-TARS)) What it does Understands screens Clicks and types Multi-step reasoning Supported Platforms Windows macOS Linux Android mobile Web browsers Tech Stack Python Vision language model Reinforcement learning Use Cases Task automation UI testing Research benchmarks

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Automate repetitive desktop or browser tasks by describing them in plain English instead of writing step-by-step code

USE CASE 2

Test web or mobile app interfaces automatically by having the agent navigate and complete tasks from a goal description

USE CASE 3

Build a computer-use assistant that operates a real desktop to fill forms, search the web, or manage files

USE CASE 4

Research and benchmark AI agent performance on standardized GUI automation tests for browser and desktop tasks

Tech stack

PythonPyTorchHugging Face

Getting it running

Difficulty · hard Time to first run · 1day+

Requires deploying a large vision-language model via Hugging Face before you can call it, no lightweight local option described.

In plain English

UI-TARS is an AI agent from ByteDance that can look at a computer screen or phone screen and perform actions on it, just as a human would by clicking, typing, scrolling, and navigating. The model is trained to understand what it sees visually and then figure out what actions to take to complete a given task. It can operate on desktop operating systems like Windows, macOS, and Linux, on mobile devices and Android emulators, and inside web browsers. The core idea is that instead of writing code to automate a specific task, you give the agent a goal in plain language and it works out the steps itself. The model can reason through a problem before taking action, which makes it more capable on tasks that require multiple steps or where the right path is not obvious from the start. Version 1.5 is built on a vision-language model combined with reinforcement learning training, which is how it develops that reasoning ability. Version 2, also called UI-TARS-2, extends the same approach to cover games, code tasks, and tool use on top of the original GUI capabilities. To use it, you deploy the model (the repository links to hosting options via Hugging Face) and then call it with a screenshot of the current screen along with a goal. The model returns a description of what action to take, such as clicking at a specific coordinate or typing a word. A post-processing library called ui-tars converts that output into executable code for controlling the mouse and keyboard. There is also a desktop application version in a separate repository for people who want to run the agent on their own machine without setting up the full deployment stack. Benchmark results show it performing competitively against other AI computer-use systems on standardized tests for browser automation and desktop task completion.

Copy-paste prompts

Prompt 1

I've deployed UI-TARS from Hugging Face. Write Python code to send it a screenshot of my browser with the goal 'Find the cheapest flight from London to Paris next week' and execute the returned action using the ui-tars library.

Prompt 2

Using UI-TARS, build a loop that takes a screenshot every 2 seconds, sends it to the model with a task goal, parses the action response, and moves the mouse or types accordingly.

Prompt 3

What is the difference between UI-TARS 1.5 and UI-TARS-2 in terms of supported task types? Which version should I use for pure browser automation versus coding tasks?

Prompt 4

Show me how to point UI-TARS at an Android emulator and have it complete a multi-step task inside a mobile app given a natural language goal.

Open on GitHub → Explain another repo

← bytedance on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.