Analysis updated 2026-05-18
Run the simulation flywheel to watch a VLA model automatically acquire a new robot movement primitive from scratch using the LIBERO benchmark.
Adapt the framework to a new robotic arm or task by defining your own movement primitives and connecting a Gemini-backed planner to guide acquisition.
Study how well a vision-language-action model generalizes to tasks outside its training distribution by tracking which primitives it needs to acquire.
| insight-vla/insight | 18597990650-lab/multi-agent-game | agents365-ai/cloakfetch | |
|---|---|---|---|
| Stars | 24 | 24 | 24 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 3/5 | 3/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires Python 3.11, Google Cloud credentials for Gemini, a running policy server, and optionally xArm 6 hardware for real-robot experiments.
InSight is a research framework from Stanford University and Princeton University that explores how a robot can teach itself new physical skills without a human having to demonstrate them. The core idea is to start with a robot that already knows a set of basic movement primitives, things like "move gripper to bowl", "lift upward", or "pour the bottle", and then automatically figure out which new primitive it needs when given an unfamiliar task, practice that primitive on its own, verify whether the practice worked, and update its model with the new skill. The system is built around a type of AI model called a Vision-Language-Action model, or VLA, which takes visual input from cameras, processes natural-language instructions, and outputs physical movements for a robot arm to perform. InSight makes this model controllable at the level of individual movement primitives rather than just high-level task descriptions. A separate vision-language model acts as planner and verifier: it breaks a new task into a sequence of primitives, identifies which ones are missing from the robot's current repertoire, proposes how to execute those missing steps using a scripted controller, and checks before-and-after images to decide whether the result was acceptable. The training process has two stages. In the first, the framework processes existing human demonstrations, breaks them into primitive-labeled segments automatically, and uses that labeled data to fine-tune a base VLA model. In the second stage, whenever the robot encounters a task with an unfamiliar primitive, the system collects new data and retrains the model to add that skill, requiring no additional human demonstrations of the new action. The code supports both simulation experiments using the LIBERO robotics benchmark and real hardware experiments using an xArm 6 robot arm. Installation requires Python 3.11 and the uv package manager, along with API access to Google's Gemini model for the vision-language planning and verification steps. Pre-trained checkpoints and datasets are documented separately. The project is released under the Apache 2.0 license.
A robotics research framework from Stanford and Princeton that lets a robot identify which movement skills it is missing for a new task and learn them automatically, without any human demonstrations of the new skill.
Mainly Python. The stack also includes Python, JAX, Flax.
Apache 2.0: use freely for any purpose, including commercial use, as long as you keep the copyright and license notices.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.