kawhiiiileo/star-kv

★ 14PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((star-kv))
    What It Does
      Compress KV cache
      No retraining needed
      Three scoring steps
    Supported Models
      UI-TARS family
      Qwen2.5-VL family
      OpenCUA family
    Benchmarks
      ScreenSpot-Pro
      Android navigation
      Desktop navigation
    Setup Needs
      GPU with CUDA
      FlashAttention-2
      HuggingFace transformers

mindmap root((star-kv)) What It Does Compress KV cache No retraining needed Three scoring steps Supported Models UI-TARS family Qwen2.5-VL family OpenCUA family Benchmarks ScreenSpot-Pro Android navigation Desktop navigation Setup Needs GPU with CUDA FlashAttention-2 HuggingFace transformers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Speed up a GUI agent that navigates Android or desktop apps by cutting its memory usage mid-task without any retraining.

USE CASE 2

Reproduce the ScreenSpot-Pro benchmark result showing 40.9% accuracy at only 20% of full cache size.

USE CASE 3

Evaluate how accurately a vision-language model can find and interact with on-screen interface elements.

USE CASE 4

Apply STaR-KV's cache-selection scoring to a UI-TARS or Qwen2.5-VL model to reduce GPU memory pressure in long multi-step tasks.

Tech stack

PythonPyTorchCUDAFlashAttention-2Hugging Face TransformersUI-TARSQwen2.5-VL

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a CUDA-capable GPU, a specific Hugging Face Transformers version, and FlashAttention-2. Model weights and benchmark datasets must be downloaded separately and paths set via environment variables.

No license information was mentioned in the explanation.

In plain English

This repository contains the official code for STaR-KV, a research method for making AI models that operate computer interfaces run more efficiently. These models, sometimes called GUI agents, look at screenshots of a computer screen and decide what to click or type. To process images, they build up an internal store of data as they work through a task, and over multiple steps this store grows large enough to slow the model down and consume more memory. STaR-KV addresses this by compressing that internal store (called the KV cache) during each inference run, without requiring the model to be retrained. It selects which stored information is worth keeping by applying three scoring adjustments. The first estimates which parts of the screen image are genuinely informative based on how the visual data is distributed internally. The second discounts items that represent redundant or outdated history from earlier steps. The third adjusts how sharply the scores are separated before the top entries are chosen for retention. The result is that the model keeps a smaller, better-prioritized slice of its history at each step. The repository supports two families of models: UI-TARS and Qwen2.5-VL style models, and OpenCUA style models. Evaluation scripts are included for several benchmarks that test how accurately a model can locate interface elements or complete navigation tasks on Android and desktop screens. The main reproduction script evaluates the UI-TARS-1.5-7B model on a benchmark called ScreenSpot-Pro, where keeping 20% of the full cache achieved an overall accuracy of 40.9%, close to the full-cache baseline. Setup requires a GPU machine with CUDA, a specific version of the Hugging Face transformers library, and FlashAttention-2. Model weights and benchmark datasets are downloaded separately. Paths are configured through environment variables, and a local override file is provided for persistent settings. The method requires no changes to the model weights themselves.

Copy-paste prompts

Prompt 1

I'm using a UI-TARS-1.5-7B model to automate screen navigation tasks. Walk me through how to apply STaR-KV cache compression from the kawhiiiileo/star-kv repo so the model uses less GPU memory during inference.

Prompt 2

Explain the three scoring adjustments STaR-KV uses to decide which parts of the KV cache to keep, and why each one matters for GUI agents.

Prompt 3

I want to reproduce the ScreenSpot-Pro benchmark with STaR-KV at 20% cache retention. What environment variables and steps do I need to configure based on the star-kv repository?

Prompt 4

How does STaR-KV differ from simply truncating the KV cache? What makes its selection method smarter for multi-step screen-navigation tasks?

Prompt 5

I have a Qwen2.5-VL model running on a GPU machine with CUDA and FlashAttention-2. Show me how to integrate star-kv's compression into my inference loop.

Open on GitHub → Explain another repo

← kawhiiiileo on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.