Train a robot arm to follow natural language instructions by predicting object affordances before generating movement commands.
Benchmark a vision-language-action model on LIBERO or CALVIN simulation environments using the provided training scripts.
Annotate a new robot manipulation dataset with affordance labels using the automated annotation pipeline.
Requires a GPU and multiple separate Python environments because the full pipeline depends on several incompatible software stacks.
AffordanceVLA is a research project from Peking University and HKUST aimed at improving how robot arms understand and execute instructions. The problem it addresses is teaching a robot to take a sentence like "pick up the red cup" and translate that into physical arm movements. Existing systems tend to jump directly from the instruction and camera image to the movement commands, which can make it hard for the robot to reason about exactly which object to interact with, where on that object to make contact, and what the 3D geometry of the manipulation looks like. The project introduces a middle step called affordance forecasting. Instead of going straight from vision and language to action, the system first predicts three things: which object in the scene is the target, where on that object the robot should make contact (expressed as a heat map over the image), and how the object is positioned in 3D space. These intermediate predictions are called affordances, a term from robotics that refers to the action possibilities an object presents. Only after building this structured picture does the model generate the actual movement commands. The architecture is split into three expert components that work in a strict sequence: an understanding module processes the camera image and the instruction, an affordance generation module produces the three affordance predictions, and an action module converts everything into a movement plan. Information flows one way through the chain, so the action module cannot feed back into the affordance stage. Training happens in three stages: first on affordance datasets from the research community, then on a large synthetic robot dataset, and finally on the specific benchmark the model is being evaluated on (LIBERO or CALVIN, both standard robot simulation environments used in academic comparisons). The repository includes the model code, training scripts, and an automated pipeline for annotating affordance data. Multiple environment files are provided because the full pipeline depends on several incompatible software stacks that need separate Python environments. The license is MIT.
← skywalker-yqz on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.