explaingit

avaturn-live/avtr-1

Analysis updated 2026-05-18

362PythonAudience · developerComplexity · 4/5Setup · hard

TLDR

AVTR-1 turns a portrait photo and audio into a real-time lip-synced talking avatar video, with support for two-speaker dialogues and live deployment on a single NVIDIA GPU.

Mindmap

mindmap
  root((AVTR-1))
    What it does
      Lip sync from audio
      Active listening motion
      Two speaker dialogue
    Tech stack
      Python
      TensorRT
      CUDA
      HuggingFace
    Use cases
      Live avatar sessions
      Offline video generation
      Dialogue video stitching
    Deployment
      Self hosted GPU
      Managed cloud API
    Setup
      NVIDIA GPU required
      TRT engine build once
      HuggingFace weights
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Generate a lip-synced talking-head video of any portrait photo speaking a given audio track.

USE CASE 2

Create a two-speaker dialogue video where each avatar reacts and listens as the other speaks.

USE CASE 3

Run a live interactive avatar streaming session for customer support or virtual presence apps.

USE CASE 4

Deploy a self-hosted avatar API on your own GPU server instead of using a cloud service.

What is it built with?

PythonTensorRTCUDAHuggingFaceONNXpixi

How does it compare?

avaturn-live/avtr-1evilsocket/auditevolink-ai/awesome-blender-seedance-workflow-usecases
Stars362397295
LanguagePythonPythonPython
Setup difficultyhardmoderatemoderate
Complexity4/54/53/5
Audiencedeveloperdeveloperdesigner

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires a Linux machine with an NVIDIA Ampere GPU or newer, CUDA 12.x, and TensorRT 10.x, plus a one-time TRT engine compilation step.

In plain English

AVTR-1 is a project that lets you turn a single portrait photo into a talking, reacting avatar that can hold live conversations. You give it a picture of a person and an audio clip, and it produces a video where that person appears to speak and listen in real time, with lip movements that match the audio. The system can handle two-speaker dialogues. Feed it audio from both sides of a conversation and it will generate video of each speaker reacting appropriately, with the avatar looking engaged while the other person is talking. This is built specifically for live use, running fast enough to keep up with a real-time audio stream on modern graphics hardware. Setup requires a Linux computer with an NVIDIA graphics card from the Ampere generation or newer (RTX 3070 and above are listed as supported). You also need CUDA and TensorRT, which are NVIDIA software frameworks for running AI models on graphics cards. The installation process downloads pre-trained model weights from HuggingFace, a public AI model hosting site, then compiles them into fast inference engines specific to your hardware. This compilation step happens once and can take a while. Once installed, you can run the demo interactively or generate video files offline. The offline mode supports single-speaker lip-sync, two-speaker dialogue with both sides rendered, or idle motion without any audio. All output is standard MP4 video, and you can stitch both sides of a dialogue into a single side-by-side video using a standard video tool called ffmpeg. The project also offers a managed cloud API at avaturn.live if you want to skip the GPU setup entirely. Model weights and inference code are publicly available, a technical report and production-ready backend are listed as coming soon.

Copy-paste prompts

Prompt 1
Using AVTR-1, show me how to generate a lip-synced video from a portrait image and a WAV audio file, step by step.
Prompt 2
Write a shell script that uses AVTR-1 to produce a two-speaker side-by-side dialogue video from two separate audio tracks.
Prompt 3
How do I set up AVTR-1 on an NVIDIA RTX 4060 Ti with CUDA 12 and TensorRT 10, including downloading weights and building TRT engines?
Prompt 4
Show me how to run the AVTR-1 interactive demo locally and connect it to a browser-based front end.
Prompt 5
What are the minimum GPU requirements for AVTR-1 to run in real time at 25 fps, and which GPUs just miss the cutoff?

Frequently asked questions

What is avtr-1?

AVTR-1 turns a portrait photo and audio into a real-time lip-synced talking avatar video, with support for two-speaker dialogues and live deployment on a single NVIDIA GPU.

What language is avtr-1 written in?

Mainly Python. The stack also includes Python, TensorRT, CUDA.

How hard is avtr-1 to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is avtr-1 for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub avaturn-live on gitmyhub

Verify against the repo before relying on details.