ju-suk/dynaflip

★ 21PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((DynaFLIP))
    What it does
      Aligns image language motion
      Shared embedding space
      Robot instruction following
    Three Encoders
      Image encoder DINOv2
      Language encoder T5
      3D motion encoder
    Usage
      Load from Hugging Face
      Embed images or text
      Feed into robot control
    Training
      Multi-GPU script
      Robot demo dataset
      Convert to HF format

mindmap root((DynaFLIP)) What it does Aligns image language motion Shared embedding space Robot instruction following Three Encoders Image encoder DINOv2 Language encoder T5 3D motion encoder Usage Load from Hugging Face Embed images or text Feed into robot control Training Multi-GPU script Robot demo dataset Convert to HF format

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Load the pretrained model from Hugging Face to get numerical embeddings of images or instructions for a robot control system.

USE CASE 2

Train a custom version of DynaFLIP on your own robot demonstration dataset with matching images, language, and 3D motion recordings.

USE CASE 3

Use DynaFLIP embeddings to build a robot that can follow natural language instructions like 'open the drawer' based on what it sees.

Tech stack

PythonPyTorchPyTorch LightningDINOv2T5Hugging Face

Getting it running

Difficulty · hard Time to first run · 30min

Training requires a multi-GPU machine and a robot demonstration dataset with matched images, language annotations, and 3D motion data.

Apache 2.0 license, use freely for any purpose including commercial, modify and distribute, but keep the license and copyright notice.

In plain English

DynaFLIP is a research project that produces a pretrained AI model designed to help robots better understand their surroundings. It was published alongside a research paper and is aimed at researchers and engineers working on robot learning, not general-purpose developers. The core idea is that robots need to connect three different kinds of information at once: what they see in images, what they are told in natural language instructions, and how objects are moving in 3D space. DynaFLIP trains a model that learns to align these three types of information in a shared mathematical space, so that an image of a scene, the phrase "close the fridge," and a recording of how a hand moved all end up represented in a way that the model can compare and relate to each other. The model is built from three separate encoders, one for images, one for language, and one for 3D motion trajectories, that are trained together. For someone who just wants to use the model rather than train it, the pretrained version is available on Hugging Face and can be loaded with two standard Python packages. Once loaded, you can pass an image to get a numerical description of what is in it, or pass text to get a numerical description of the instruction. These representations can then feed into downstream robot control systems. For researchers who want to train their own version, the repository includes the full training code, a configuration file, and a script for multi-GPU training. Training requires a dataset of robot demonstrations with matching images, language annotations, and 3D motion data, with the dataset paths configured before the training run starts. A separate script converts a trained checkpoint into the Hugging Face format for distribution. The project is built on top of several established research tools including DINOv2 for image encoding, T5 for language encoding, and PyTorch Lightning for the training framework. It is licensed under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

I want to use DynaFLIP to embed robot camera images and language instructions into a shared space. Show me how to load the Hugging Face checkpoint and run inference on a sample image and instruction string.

Prompt 2

Walk me through setting up the DynaFLIP training script for a custom robot demonstration dataset, what data format does it expect for images, language annotations, and 3D motion trajectories?

Prompt 3

How do I convert a trained DynaFLIP checkpoint into Hugging Face format for distribution after my training run finishes?

Open on GitHub → Explain another repo

← ju-suk on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.