Analysis updated 2026-05-18
Reproduce the CVPR 2026 TeHOR paper results by running the provided scripts on the example images.
Use the reconstruction pipeline as a baseline to compare against your own 3D human-object interaction method.
Generate textured 3D models of a person and object from a single photograph for non-commercial research or visualization.
| hygenie1228/tehor_release | danieldoradotalaveron-rb/yolosegment-2d-to-3d-rebotarm_pick_and_place | ewreaslan/jwttx | |
|---|---|---|---|
| Stars | 9 | 9 | 9 |
| Language | Python | Python | Python |
| Setup difficulty | hard | hard | easy |
| Complexity | 5/5 | 5/5 | 3/5 |
| Audience | researcher | researcher | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires a CUDA GPU, multiple large pretrained model downloads, and an OpenAI API key for text prompt generation.
This is the official code release for TeHOR, a computer vision research system published at CVPR 2026 by researchers at Seoul National University. The problem it solves is specific: given a single photo containing a person and an object, reconstruct both as detailed 3D models, complete with surface textures and a realistic physical relationship between them. The key insight in the approach is that contact alone is not enough to understand how people relate to objects. Someone gazing at a book or pointing at a sign is meaningfully interacting with an object without touching it. TeHOR uses text descriptions to capture this broader sense of interaction, so the resulting 3D reconstruction reflects not just where bodies and objects are in space but also the semantic nature of what is happening between them. The output is a pair of textured 3D models that fit together coherently: a human figure and an object, positioned and oriented relative to each other in a way that matches the original photo. The system handles appearance as well as shape, meaning the surfaces have colors and textures drawn from the input image rather than being blank geometry. Running TeHOR requires significant infrastructure. Setup involves Python 3.10, a specific version of PyTorch, a CUDA-capable GPU, and a large number of third-party dependencies installed through multiple setup scripts. The data directory expects several large pretrained models downloaded separately, and the pipeline also calls the OpenAI API for text prompt generation, requiring an API key. The workflow is three steps: preprocess the input image to extract the human and object separately, optionally align an object mesh to a depth estimate, then run the main optimization script. This is a research codebase, not a consumer tool. It is aimed at computer vision researchers or advanced practitioners who want to reproduce the paper's results or build on the method. The license is CC BY-NC 4.0, which allows use and adaptation for non-commercial purposes only.
A research system that reconstructs a textured 3D human figure and object from a single photo, using text descriptions to capture how they interact.
Mainly Python. The stack also includes Python, PyTorch, CUDA.
CC BY-NC 4.0: free to use and modify for non-commercial purposes, with attribution required.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.