Speed up a GUI agent that navigates Android or desktop apps by cutting its memory usage mid-task without any retraining.
Reproduce the ScreenSpot-Pro benchmark result showing 40.9% accuracy at only 20% of full cache size.
Evaluate how accurately a vision-language model can find and interact with on-screen interface elements.
Apply STaR-KV's cache-selection scoring to a UI-TARS or Qwen2.5-VL model to reduce GPU memory pressure in long multi-step tasks.
Requires a CUDA-capable GPU, a specific Hugging Face Transformers version, and FlashAttention-2. Model weights and benchmark datasets must be downloaded separately and paths set via environment variables.
This repository contains the official code for STaR-KV, a research method for making AI models that operate computer interfaces run more efficiently. These models, sometimes called GUI agents, look at screenshots of a computer screen and decide what to click or type. To process images, they build up an internal store of data as they work through a task, and over multiple steps this store grows large enough to slow the model down and consume more memory. STaR-KV addresses this by compressing that internal store (called the KV cache) during each inference run, without requiring the model to be retrained. It selects which stored information is worth keeping by applying three scoring adjustments. The first estimates which parts of the screen image are genuinely informative based on how the visual data is distributed internally. The second discounts items that represent redundant or outdated history from earlier steps. The third adjusts how sharply the scores are separated before the top entries are chosen for retention. The result is that the model keeps a smaller, better-prioritized slice of its history at each step. The repository supports two families of models: UI-TARS and Qwen2.5-VL style models, and OpenCUA style models. Evaluation scripts are included for several benchmarks that test how accurately a model can locate interface elements or complete navigation tasks on Android and desktop screens. The main reproduction script evaluates the UI-TARS-1.5-7B model on a benchmark called ScreenSpot-Pro, where keeping 20% of the full cache achieved an overall accuracy of 40.9%, close to the full-cache baseline. Setup requires a GPU machine with CUDA, a specific version of the Hugging Face transformers library, and FlashAttention-2. Model weights and benchmark datasets are downloaded separately. Paths are configured through environment variables, and a local override file is provided for persistent settings. The method requires no changes to the model weights themselves.
← kawhiiiileo on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.