antgroup/echomimic_v2

★ 4,568PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((EchoMimicV2))
    Inputs
      Still photo
      Audio clip
      Pose reference video
    Outputs
      Animated video
      Lip sync
      Upper body motion
    Interfaces
      Python scripts
      Gradio web UI
      ComfyUI workflow
    Research
      CVPR 2025
      Ant Group

mindmap root((EchoMimicV2)) Inputs Still photo Audio clip Pose reference video Outputs Animated video Lip sync Upper body motion Interfaces Python scripts Gradio web UI ComfyUI workflow Research CVPR 2025 Ant Group

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Generate a talking-head video from a single photo and a speech audio file for a digital avatar or presentation.

USE CASE 2

Create animated spokespersons for video content without filming real people, using just a photo and script audio.

USE CASE 3

Test EchoMimicV2 through the Gradio web interface without writing any Python code.

USE CASE 4

Integrate EchoMimicV2 into a ComfyUI visual workflow for automated talking-head video production.

Tech stack

PythonPyTorchGradioComfyUIJupyter

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a high-end GPU such as an A100 and downloading multi-gigabyte model weights from Hugging Face before first use.

In plain English

EchoMimicV2 is an AI system developed by researchers at Ant Group (the company behind Alipay) that generates video animations of a person talking from just a still photo and an audio clip. You provide a reference image and a speech recording, and the system produces a video where the person in the image appears to speak, with lips, head, and upper body moving in sync with the audio. It covers more than just the face: it animates the upper half of the body including shoulder and hand movement, which the researchers call semi-body animation. The work was accepted at CVPR 2025, one of the top computer vision research conferences. The system supports both English and Chinese audio input. Standard inference takes roughly 7 minutes to produce 120 frames of video, an accelerated version released in January 2025 cuts that to about 50 seconds on a high-end A100 GPU, a 9x improvement. A Gradio web interface lets users test it without writing Python code, and a ComfyUI integration is available for those who prefer that visual workflow tool. The process internally aligns the reference image with pose information extracted from a driving video, then generates the final animated output. The repository includes model weights hosted on Hugging Face and ModelScope, inference scripts, a Jupyter notebook demo, and the training dataset list along with processing scripts. This is a research release aimed at people working in AI video generation, digital avatar creation, or related areas. It is not a simple consumer application: setup requires installing several Python dependencies and downloading multi-gigabyte model weights. The README links to installation tutorials and a community discussion thread covering common setup problems.

Copy-paste prompts

Prompt 1

Using EchoMimicV2, write Python inference code to generate a talking video from a reference image and a WAV audio file.

Prompt 2

How do I launch the Gradio interface for EchoMimicV2 and test it with my own photo and audio clip?

Prompt 3

Help me add EchoMimicV2 as a node in a ComfyUI workflow to automate talking-head video generation.

Prompt 4

What GPU and VRAM does EchoMimicV2 need to run the accelerated inference mode that produces video in 50 seconds?

Open on GitHub → Explain another repo

← antgroup on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.