explaingit

antgroup/echomimic_v2

4,568PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

An AI research tool that generates a video of a person speaking from just a still photo and an audio clip, animating the lips, head, and upper body in sync with the audio. Supports English and Chinese.

Mindmap

mindmap
  root((EchoMimicV2))
    Inputs
      Still photo
      Audio clip
      Pose reference video
    Outputs
      Animated video
      Lip sync
      Upper body motion
    Interfaces
      Python scripts
      Gradio web UI
      ComfyUI workflow
    Research
      CVPR 2025
      Ant Group
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Generate a talking-head video from a single photo and a speech audio file for a digital avatar or presentation.

USE CASE 2

Create animated spokespersons for video content without filming real people, using just a photo and script audio.

USE CASE 3

Test EchoMimicV2 through the Gradio web interface without writing any Python code.

USE CASE 4

Integrate EchoMimicV2 into a ComfyUI visual workflow for automated talking-head video production.

Tech stack

PythonPyTorchGradioComfyUIJupyter

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a high-end GPU such as an A100 and downloading multi-gigabyte model weights from Hugging Face before first use.

In plain English

EchoMimicV2 is an AI system developed by researchers at Ant Group (the company behind Alipay) that generates video animations of a person talking from just a still photo and an audio clip. You provide a reference image and a speech recording, and the system produces a video where the person in the image appears to speak, with lips, head, and upper body moving in sync with the audio. It covers more than just the face: it animates the upper half of the body including shoulder and hand movement, which the researchers call semi-body animation. The work was accepted at CVPR 2025, one of the top computer vision research conferences. The system supports both English and Chinese audio input. Standard inference takes roughly 7 minutes to produce 120 frames of video, an accelerated version released in January 2025 cuts that to about 50 seconds on a high-end A100 GPU, a 9x improvement. A Gradio web interface lets users test it without writing Python code, and a ComfyUI integration is available for those who prefer that visual workflow tool. The process internally aligns the reference image with pose information extracted from a driving video, then generates the final animated output. The repository includes model weights hosted on Hugging Face and ModelScope, inference scripts, a Jupyter notebook demo, and the training dataset list along with processing scripts. This is a research release aimed at people working in AI video generation, digital avatar creation, or related areas. It is not a simple consumer application: setup requires installing several Python dependencies and downloading multi-gigabyte model weights. The README links to installation tutorials and a community discussion thread covering common setup problems.

Copy-paste prompts

Prompt 1
Using EchoMimicV2, write Python inference code to generate a talking video from a reference image and a WAV audio file.
Prompt 2
How do I launch the Gradio interface for EchoMimicV2 and test it with my own photo and audio clip?
Prompt 3
Help me add EchoMimicV2 as a node in a ComfyUI workflow to automate talking-head video generation.
Prompt 4
What GPU and VRAM does EchoMimicV2 need to run the accelerated inference mode that produces video in 50 seconds?
Open on GitHub → Explain another repo

← antgroup on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.