explaingit

insight-vla/insight

Analysis updated 2026-05-18

24PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

TLDR

A robotics research framework from Stanford and Princeton that lets a robot identify which movement skills it is missing for a new task and learn them automatically, without any human demonstrations of the new skill.

Mindmap

mindmap
  root((InSight))
    What it does
      Self-guided skill acquisition
      Primitive gap identification
      VLM-verified rollout collection
      VLA retraining loop
    Tech Stack
      Python
      JAX and Flax
      Gemini VLM API
      LIBERO simulation
      xArm hardware
    Use Cases
      Robotics research
      VLA fine-tuning
      Skill generalization study
    Setup
      Python 3.11 and uv
      Google Cloud credentials
      Policy server required
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Run the simulation flywheel to watch a VLA model automatically acquire a new robot movement primitive from scratch using the LIBERO benchmark.

USE CASE 2

Adapt the framework to a new robotic arm or task by defining your own movement primitives and connecting a Gemini-backed planner to guide acquisition.

USE CASE 3

Study how well a vision-language-action model generalizes to tasks outside its training distribution by tracking which primitives it needs to acquire.

What is it built with?

PythonJAXFlaxGemini APILIBEROxArm SDK

How does it compare?

insight-vla/insight18597990650-lab/multi-agent-gameagents365-ai/cloakfetch
Stars242424
LanguagePythonPythonPython
Setup difficultyhardmoderatemoderate
Complexity5/53/53/5
Audienceresearcherdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires Python 3.11, Google Cloud credentials for Gemini, a running policy server, and optionally xArm 6 hardware for real-robot experiments.

Apache 2.0: use freely for any purpose, including commercial use, as long as you keep the copyright and license notices.

In plain English

InSight is a research framework from Stanford University and Princeton University that explores how a robot can teach itself new physical skills without a human having to demonstrate them. The core idea is to start with a robot that already knows a set of basic movement primitives, things like "move gripper to bowl", "lift upward", or "pour the bottle", and then automatically figure out which new primitive it needs when given an unfamiliar task, practice that primitive on its own, verify whether the practice worked, and update its model with the new skill. The system is built around a type of AI model called a Vision-Language-Action model, or VLA, which takes visual input from cameras, processes natural-language instructions, and outputs physical movements for a robot arm to perform. InSight makes this model controllable at the level of individual movement primitives rather than just high-level task descriptions. A separate vision-language model acts as planner and verifier: it breaks a new task into a sequence of primitives, identifies which ones are missing from the robot's current repertoire, proposes how to execute those missing steps using a scripted controller, and checks before-and-after images to decide whether the result was acceptable. The training process has two stages. In the first, the framework processes existing human demonstrations, breaks them into primitive-labeled segments automatically, and uses that labeled data to fine-tune a base VLA model. In the second stage, whenever the robot encounters a task with an unfamiliar primitive, the system collects new data and retrains the model to add that skill, requiring no additional human demonstrations of the new action. The code supports both simulation experiments using the LIBERO robotics benchmark and real hardware experiments using an xArm 6 robot arm. Installation requires Python 3.11 and the uv package manager, along with API access to Google's Gemini model for the vision-language planning and verification steps. Pre-trained checkpoints and datasets are documented separately. The project is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Explain how InSight's two-stage pipeline works: what happens in the steerable VLA training stage and what happens in the VLM-guided flywheel stage?
Prompt 2
I want to run InSight's LIBERO block-flip simulation experiment. What do I need to install, what API keys do I need, and what commands do I run?
Prompt 3
How does InSight use a VLM as a planner and a separate VLM as an oracle? What inputs does each one receive and what decisions does each one make?
Prompt 4
What is a Vision-Language-Action model and how does InSight make one steerable at the level of individual movement primitives rather than full task descriptions?

Frequently asked questions

What is insight?

A robotics research framework from Stanford and Princeton that lets a robot identify which movement skills it is missing for a new task and learn them automatically, without any human demonstrations of the new skill.

What language is insight written in?

Mainly Python. The stack also includes Python, JAX, Flax.

What license does insight use?

Apache 2.0: use freely for any purpose, including commercial use, as long as you keep the copyright and license notices.

How hard is insight to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is insight for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub insight-vla on gitmyhub

Verify against the repo before relying on details.