Run the released HEX 2.4B checkpoint in eval_model.ipynb to test humanoid policies on your own data
Fine-tune the VLA model on a new humanoid platform using the cross-embodiment slot scheme
Pretrain a custom whole-body manipulation policy on the AgiBot World plus Humanoid Everyday mixture
Reproduce paper results on Unitree G1 or Tienkung robots
Needs CUDA GPU, FlashAttention 2 wheels matched to your card, EGL/Mesa system libs, and Hugging Face downloads of both HEX and Qwen3-VL checkpoints.
HEX is research code from the Open-X-Humanoid project that goes with a paper titled Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation. In plain terms, it is a control system for full sized humanoid robots that takes camera input plus a language instruction and decides how the robot should move. The README describes it as a vision language action framework, with a 2.4 billion parameter model released on Hugging Face under the name HEX-model. The model is built from three parts. There is a Qwen-VL backbone, which is a pretrained vision and language model that reads images and text. There is a unified proprioceptive predictor, which takes the robot's own joint and sensor readings and lines them up across different robot bodies. And there is a flow matching action head, which outputs the next stretch of continuous arm, hand, and waist motions. A separate reinforcement learning controller handles the legs and follows high level commands from the main policy, which is meant to keep the robot stable while it manipulates objects. A key claim is cross embodiment training. The team aligns data from several different humanoid platforms, including the Tienkung series, Unitree G1, Unitree H1, and Leju Kuavo, into shared body part slots so the policy learns one set of dynamics that transfers across the different machines. The training mixture pulls from their own released dataset and from public sets like Humanoid Everyday, AgiBot World Colosseo with the TrajBooster retargeting, and RoboCOIN, with links to each on Hugging Face. The install path is conda based. You clone the repo, create a Python 3.10 environment, apt install some EGL and Mesa system libraries, pip install the requirements, install FlashAttention 2, and then pip install -e the package itself. The README includes a fallback recipe for newer GPUs like an RTX 5090 where the prebuilt wheels for FlashAttention may not match, and points readers at the official wheels page. To run inference you download the HEX checkpoint and the Qwen3-VL base model from Hugging Face, point the config.yaml at your local Qwen path, and open a Jupyter notebook called eval_model.ipynb that the team ships in the notebooks folder. For pretraining and fine tuning there are bash scripts under scripts/ where you set the base VLM path, the data root, and a dataset mixture name that has to match the entries listed in the dataloader files. The team notes that data collection code for the Tienkung robots cannot be released due to commercial restrictions, and points users who want to gather data on Unitree G1 to two outside open source teleoperation projects, OpenTrajBooster and Psi0.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.