Analysis updated 2026-05-18
Reproduce the ECCV 2026 HOPformer results on ARCTIC or EPIC-Kitchens datasets
Train a model to estimate 3D hand and object pose from egocentric video
Use the EPIC-Contact dataset for 3D hand-object contact research
| sid2697/hopformer | 0petru/sentimo | alingalingling/akasha-wechat | |
|---|---|---|---|
| Stars | 17 | 17 | 17 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | hard |
| Complexity | 5/5 | 3/5 | 4/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires CUDA GPU, conda, MANO model registration, WiLoR weights download, and a manual patch to the smplx package.
HOPformer is a Python research codebase published alongside an academic paper from ECCV 2026. It addresses a narrow, specific problem: given a single RGB image from an egocentric camera (the kind mounted on a person's head or glasses), the system tries to figure out the 3D positions and shapes of both hands and any object being held or manipulated, all in one step. The system uses a type of neural network called a transformer. It relies on a hand-shape model called MANO to represent hand geometry, and draws on a pre-trained hand analysis model called WiLoR to give it strong prior knowledge about how hands look. By combining those two sources of information through cross-attention, the model can handle situations where hands and objects heavily overlap or block each other, which is common in first-person video. Alongside the model, the authors release EPIC-Contact, a dataset of roughly 2,300 short video clips containing labeled 3D hand and object contact information. This dataset was built using a separate fitting pipeline called EC-fit, which is also included. The codebase supports training and evaluation on two datasets: ARCTIC (a lab-recorded collection of bimanual manipulation tasks) and EPIC-Kitchens (a large real-world first-person cooking and household activity dataset). The results listed in the README show meaningful improvements over previous methods on both. Setup requires Python 3.10, PyTorch 2.5.1, and a CUDA-capable GPU. Installation involves creating a conda environment, downloading several hand body models from third-party websites, downloading WiLoR model weights, and manually patching a dependency to return the correct number of hand joints. The datasets require registration credentials to download. This repo is aimed at computer vision researchers who want to reproduce the paper's results or build on top of the HOPformer method. It is not a ready-to-use application for non-researchers. The README is detailed and covers installation carefully, but the overall setup involves multiple manual steps and external dependencies.
HOPformer is a research model that predicts 3D hand and object positions from a single first-person camera image, released with code and a new labeled dataset of egocentric video clips.
Mainly Python. The stack also includes Python, PyTorch, CUDA.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.