Analysis updated 2026-05-18
Improve a behavior-cloned robot manipulation policy without collecting new human demonstrations, using self-supervised autonomous practice.
Train a residual correction network that patches a frozen pi-zero policy using dense per-step rewards from the SARM2 reward model.
| qianzhong-chen/openspiral | andyuneducated/resolve-ai | carriex6/cvpr2026_similarity_as_evidence | |
|---|---|---|---|
| Stars | 18 | 18 | 18 |
| Language | Python | Python | Python |
| Setup difficulty | hard | hard | hard |
| Complexity | 5/5 | 4/5 | 4/5 |
| Audience | researcher | developer | researcher |
Figures from each repo's GitHub metadata at analysis time.
Requires 24+ GB GPU memory for training and a real robot with SARM2-labeled demonstration data.
OpenSpiral is a research code package that implements SPIRAL, a method for making a robot manipulation policy improve itself through autonomous practice rather than requiring new human demonstrations. It is part of a research paper on self-improving robotic manipulation and builds on top of Physical Intelligence's open-source pi-zero vision-language-action model. The core idea is that training robot policies from human demonstrations is expensive because you need many high-quality examples. SPIRAL takes an existing behavior-cloned policy and adds a small secondary network called a residual policy, which is trained through reinforcement learning. Instead of backpropagating through the entire large model, only the small residual network updates. The robot collects data by practicing autonomously, and a companion reward model called SARM2 assigns a progress score to each step of each attempt. Those scores serve as the reward signal for the reinforcement learning update. At each time step, the frozen base policy produces an action and the residual network produces a small correction. The sum is the actual command sent to the robot. A set of five neural networks called a critic ensemble evaluates how good each action is, combining short-term step rewards with long-term episode returns so the system can handle multi-step tasks that take several minutes to complete. The pipeline has four stages: fine-tune the base pi-zero policy on demonstrations, use SARM2 to assign dense rewards to the data, run the residual reinforcement learning update, and collect new robot rollouts to repeat the cycle. The training requires at least 24 gigabytes of GPU memory for the residual learning step and relies on a real robot rather than a simulator. Installation uses the uv Python package manager. The repository does not include a stated license in the README.
Research code that trains a small residual network on top of a frozen pi-zero robot policy so the robot can improve itself through autonomous practice using dense step-by-step rewards.
Mainly Python. The stack also includes Python, JAX, PyTorch.
License terms are not stated in the README, check the repository's LICENSE file.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.