explaingit

fudan-generative-vision/hallo

8,645PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

A research AI tool that takes a still portrait photo and an audio clip and generates a short video of that face speaking or singing in sync with the audio, running on a GPU with pretrained models from HuggingFace.

Mindmap

mindmap
  root((hallo))
    What it does
      Portrait animation
      Audio driven video
      Lip sync generation
    Inputs
      Still portrait photo
      Speech or audio clip
    Setup
      Ubuntu Linux
      NVIDIA GPU and CUDA
      HuggingFace weights
    Community Tools
      ComfyUI integration
      Docker image
      Web browser demo
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Animate a still portrait photo to speak or sing by providing an audio clip and running the inference script

USE CASE 2

Create talking-head videos for presentations or demos without recording real video footage

USE CASE 3

Try portrait animation in a browser using the hosted Hugging Face demo without installing anything locally

USE CASE 4

Train or fine-tune the model on custom image and audio data using the released training code

Tech stack

PythonPyTorchCUDAffmpegHuggingFace

Getting it running

Difficulty · hard Time to first run · 1day+

Requires an NVIDIA GPU tested on A100s with CUDA 12.1 on Ubuntu 20.04 or 22.04, plus large model weight downloads from HuggingFace.

In plain English

Hallo is a research project from Fudan University and collaborators that takes a still portrait photo and an audio clip, then generates a video of that person's face moving and speaking in sync with the audio. The name stands for Hierarchical Audio-Driven Visual Synthesis. The core idea is that you supply one image of a face and one recording of speech or singing, and the system produces a short animated video where the portrait appears to talk or perform the audio naturally. The project was built by a team of researchers and comes with pretrained model weights you can download from HuggingFace. Once the weights are in place, you run a single Python inference script pointing at your image and audio files. It requires a Linux machine with a compatible NVIDIA GPU and CUDA installed. The README specifically lists Ubuntu 20.04 or 22.04 and CUDA 12.1, with testing done on A100 graphics cards. Setup involves creating a Python environment, installing the listed packages, and also installing ffmpeg for video processing. Beyond basic inference, the team later released training code as well, so users with their own image and audio data can attempt to train or fine-tune models themselves. The community has built several wrappers around the core code, including a Windows port, a Docker image, a ComfyUI integration, and a web-based interface, all linked from the README. There is also a Hugging Face hosted demo where you can try the tool in a browser without installing anything locally. The repository targets researchers and developers who want to experiment with audio-driven face animation. Non-technical users looking for a quick browser demo can use the Hugging Face space. Anyone who wants to run it locally will need some comfort with command-line setup, GPU hardware, and downloading large model files. The README walks through each step including model download, data preparation, and the inference command.

Copy-paste prompts

Prompt 1
I have a portrait image portrait.jpg and an audio file speech.wav. Give me the full command to run hallo inference on Ubuntu 22.04 with CUDA to generate a talking-head video.
Prompt 2
Walk me through setting up the hallo environment on Ubuntu 22.04 with CUDA 12.1, including creating the Python environment, installing packages, and downloading pretrained weights from HuggingFace.
Prompt 3
How do I use the ComfyUI integration for hallo to run portrait animation as a node in a ComfyUI workflow?
Prompt 4
What are the hardware requirements to run hallo locally, and what is the minimum GPU VRAM needed for inference?
Prompt 5
How do I use the hallo Docker image to run portrait animation without manually setting up CUDA and Python dependencies?
Open on GitHub → Explain another repo

← fudan-generative-vision on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.