tmelyralab/musetalk

★ 5,744PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((MuseTalk))
    What It Does
      Lip sync replacement
      Real time 30fps
      Face video editing
    Use Cases
      Video dubbing
      Virtual avatars
      Content creation
    Tech Stack
      Python PyTorch
      CUDA GPU
      MMLab FFmpeg
    Setup
      NVIDIA GPU required
      Python 3.10
      Pretrained weights
      HuggingFace demo

mindmap root((MuseTalk)) What It Does Lip sync replacement Real time 30fps Face video editing Use Cases Video dubbing Virtual avatars Content creation Tech Stack Python PyTorch CUDA GPU MMLab FFmpeg Setup NVIDIA GPU required Python 3.10 Pretrained weights HuggingFace demo

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Dub a video of a person speaking into a different language by swapping in new audio and regenerating lip movements to match.

USE CASE 2

Create a virtual human avatar that moves its lips in sync with synthesized speech for a chatbot or interactive demo.

USE CASE 3

Generate lip-synced video content for social media or presentations without hiring on-camera talent.

Tech stack

PythonPyTorchFFmpegCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Requires an NVIDIA GPU, Python 3.10, PyTorch, MMLab packages, FFmpeg, and a separate download of pretrained model weights.

In plain English

MuseTalk is a Python tool that takes a video of a person's face and replaces the lip movements to match a new audio track. The result is a video where the person appears to be speaking whatever audio you provide, in real time at 30 frames per second or more on appropriate hardware. The practical use case is dubbing: if you have a video of someone speaking in one language, you can generate new audio in another language and use MuseTalk to make the person's lips match that new audio. It also works for creating virtual human avatars that respond to spoken input. The project comes from Tencent Music Entertainment's research lab and is now on version 1.5, which the team says has noticeably better visual quality and more accurate lip-sync than the original. Under the hood, the model works differently from most AI image tools you may have heard of. It does not generate images step by step the way diffusion models do. Instead, it takes a single pass to fill in just the mouth region of each frame, using audio information to decide what the lips should look like. It was trained on a combination of video datasets and uses several types of training signals to improve sharpness and synchronization accuracy. Setting it up requires a machine with an NVIDIA GPU, Python 3.10, and a fairly involved installation process: you install PyTorch, a set of computer vision packages from a project called MMLab, FFmpeg for video handling, and then download the pretrained model weights separately. The README walks through each step in detail. A no-install demo is also available on Hugging Face Spaces if you want to try it before committing to the setup. The training code was open-sourced in April 2025, so you can train your own version of the model if you have the data and compute budget. The README includes links to the technical paper for anyone who wants to understand the architecture in more depth.

Copy-paste prompts

Prompt 1

Using MuseTalk, show me how to take a source video and a new audio file and generate a lip-synced output video from the command line.

Prompt 2

How do I set up MuseTalk on a machine with an NVIDIA GPU, including PyTorch, MMLab packages, FFmpeg, and downloading the pretrained weights?

Prompt 3

What are the hardware and software requirements to run MuseTalk at real-time 30fps lip sync locally?

Prompt 4

Using MuseTalk, how can I create a talking avatar that responds to text-to-speech audio for a virtual assistant demo?

Open on GitHub → Explain another repo

← tmelyralab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.