nvidia/vid2vid

★ 8,714PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((vid2vid))
    What it does
      Semantic to video
      Photo-realistic output
      Temporal consistency
    Inputs
      Color region maps
      Edge outlines
      Pose skeletons
    Tech
      Python
      PyTorch
      CUDA
    Requirements
      Linux or macOS
      NVIDIA GPU
      8 GPUs for full res

mindmap root((vid2vid)) What it does Semantic to video Photo-realistic output Temporal consistency Inputs Color region maps Edge outlines Pose skeletons Tech Python PyTorch CUDA Requirements Linux or macOS NVIDIA GPU 8 GPUs for full res

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert a semantic segmentation map of a city street into a photorealistic driving video using the pretrained model.

USE CASE 2

Generate a realistic talking-head video from a simple edge outline of a face.

USE CASE 3

Synthesize a moving person in video from skeleton pose keypoints for animation research.

Tech stack

PythonPyTorchCUDALinux

Getting it running

Difficulty · hard Time to first run · 1day+

Training at full 2048x1024 resolution requires 8 NVIDIA GPUs with at least 24 GB memory each, pretrained models are available for quick inference tests.

In plain English

NVIDIA vid2vid is a research project that takes a video made of simplified visual inputs, such as colored region maps, edge outlines, or body pose skeletons, and generates a realistic-looking video that matches them. For example, you can take a map of a city street where each region is colored by category (road, sidewalk, building, sky) and produce a photorealistic video that looks like you are actually driving through that street. Other examples include generating a talking face from a simple edge outline of the face, or generating a person moving from a skeleton of their joints. The project was published at NeurIPS 2018 by researchers from NVIDIA and MIT. It builds on earlier NVIDIA image translation work and focuses specifically on making the output look consistent and smooth across video frames, not just frame by frame. To use it, you need Linux or macOS, Python 3, and an NVIDIA graphics card with CUDA support. Training at the highest resolution (2048 by 1024 pixels) requires 8 GPUs with at least 24 GB of memory each, so this is aimed at researchers and teams with significant hardware. Pre-trained models are available for the street and face examples, so you can test the system without training from scratch. The training process works at increasing resolutions in stages, starting small and working up to the full output size. The README includes detailed instructions for downloading datasets and pre-trained models, running tests, and training your own models on city street, face, and human pose data. This is a research code release tied to the published paper. It is not a finished product and is intended for academic exploration rather than production use.

Copy-paste prompts

Prompt 1

How do I run vid2vid inference with the pretrained Cityscapes street model to convert a semantic label video into a photorealistic output?

Prompt 2

Walk me through downloading the face dataset and running vid2vid training at 512x256 resolution with a single GPU.

Prompt 3

How does vid2vid maintain temporal consistency across frames, and what loss functions prevent flickering between consecutive frames?

Prompt 4

What are the minimum GPU memory requirements for vid2vid inference vs full-resolution 2048x1024 training?

Prompt 5

How do I add my own dataset of paired segmentation maps and real videos to train a custom vid2vid model?

Open on GitHub → Explain another repo

← nvidia on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.