skywalker-yqz/affordancevla

★ 16PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((affordancevla))
    Core idea
      Affordance forecasting
      Target object prediction
      Contact point heat map
      3D object pose
    Architecture
      Understanding module
      Affordance generation
      Action module
    Training stages
      Affordance datasets
      Synthetic robot data
      Benchmark fine-tuning
    Benchmarks
      LIBERO simulation
      CALVIN simulation

mindmap root((affordancevla)) Core idea Affordance forecasting Target object prediction Contact point heat map 3D object pose Architecture Understanding module Affordance generation Action module Training stages Affordance datasets Synthetic robot data Benchmark fine-tuning Benchmarks LIBERO simulation CALVIN simulation

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Train a robot arm to follow natural language instructions by predicting object affordances before generating movement commands.

USE CASE 2

Benchmark a vision-language-action model on LIBERO or CALVIN simulation environments using the provided training scripts.

USE CASE 3

Annotate a new robot manipulation dataset with affordance labels using the automated annotation pipeline.

Tech stack

Python

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a GPU and multiple separate Python environments because the full pipeline depends on several incompatible software stacks.

Use freely for any purpose, including commercial use, as long as you keep the copyright notice.

In plain English

AffordanceVLA is a research project from Peking University and HKUST aimed at improving how robot arms understand and execute instructions. The problem it addresses is teaching a robot to take a sentence like "pick up the red cup" and translate that into physical arm movements. Existing systems tend to jump directly from the instruction and camera image to the movement commands, which can make it hard for the robot to reason about exactly which object to interact with, where on that object to make contact, and what the 3D geometry of the manipulation looks like. The project introduces a middle step called affordance forecasting. Instead of going straight from vision and language to action, the system first predicts three things: which object in the scene is the target, where on that object the robot should make contact (expressed as a heat map over the image), and how the object is positioned in 3D space. These intermediate predictions are called affordances, a term from robotics that refers to the action possibilities an object presents. Only after building this structured picture does the model generate the actual movement commands. The architecture is split into three expert components that work in a strict sequence: an understanding module processes the camera image and the instruction, an affordance generation module produces the three affordance predictions, and an action module converts everything into a movement plan. Information flows one way through the chain, so the action module cannot feed back into the affordance stage. Training happens in three stages: first on affordance datasets from the research community, then on a large synthetic robot dataset, and finally on the specific benchmark the model is being evaluated on (LIBERO or CALVIN, both standard robot simulation environments used in academic comparisons). The repository includes the model code, training scripts, and an automated pipeline for annotating affordance data. Multiple environment files are provided because the full pipeline depends on several incompatible software stacks that need separate Python environments. The license is MIT.

Copy-paste prompts

Prompt 1

Set up the AffordanceVLA training pipeline on my GPU machine, walk me through which environment files to install for each stage and why they need to be separate environments.

Prompt 2

Run the affordance annotation pipeline on my own robot demonstration dataset to generate contact point heat maps and 3D object pose predictions.

Prompt 3

Reproduce the AffordanceVLA results on the LIBERO benchmark, show me the three training stages in order and which dataset each one requires.

Open on GitHub → Explain another repo

← skywalker-yqz on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.