Run zero-shot pick-and-place on an SO-101 arm from a natural-language instruction.
Sanity-check predicted joint targets with the dry-run mode before powering the motors.
Reproduce a robot-arm demo of MolmoAct2 on a single RTX 3080 laptop.
Calibrate Feetech servo motors and store the per-joint zero positions in the repo configs.
Needs the SO-101 arm, a RealSense D455 on USB 3, a wrist webcam, a recent NVIDIA GPU, and the exact LeRobot 0.5.1 pin or the arm can slam into the table on startup.
This repository is a working example of controlling a small robot arm called SO-101 using natural-language instructions, without any training or demonstration data collected by the user. It runs MolmoAct2, a vision-language-action model recently released by AI2, the Allen Institute for AI. You type something like pick up the lemon and drop it in the red bowl, and the model takes camera images plus the current arm position and outputs the next sequence of joint movements at 30 times per second. The hardware list is short. You need an SO-101 follower arm, which is an open-source robot arm design, an Intel RealSense D455 depth camera mounted to view the workspace from the side (only its color image is used), and a regular USB webcam attached to the wrist of the arm. The README notes that the RealSense camera needs a USB 3 data cable and port to run at full speed. Setup uses Python 3.12 in a conda environment, with a single pip install -r requirements.txt that pulls in PyTorch with CUDA, version 0.5.1 of a library called LeRobot, the Feetech servo driver for the arm motors, and the HuggingFace tools. The MolmoAct2 model weights, around 15 GB, are downloaded automatically the first time you run inference. The README is very firm that the LeRobot version must be exactly 0.5.1, because the joint-angle conventions changed between versions and a mismatch can cause the arm to slam into the table on startup. Before running the model on the arm, you calibrate each motor's zero position with lerobot-calibrate and copy the result into the repo's configs folder. The main script is inference.py, taking a follower port, a wrist camera ID, and a prompt. A dry-run mode prints predicted joint targets without moving the arm so you can sanity-check the setup first. A GPU is required, and a laptop with an RTX 3080 or better is recommended.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.