Generate realistic spoken audio from text using a plain-English prompt to select and run the right synthesizer automatically.
Transcribe or enhance audio recordings by describing what you want in natural language rather than calling individual model APIs.
Create sound effects from text descriptions or generate audio that matches the mood of an input image.
Produce a talking-head video of a face speaking from an audio clip using a single text instruction.
Requires downloading and installing multiple large pre-trained models, many components are still marked as works in progress.
AudioGPT is a research project that connects a large language model (a ChatGPT-style AI) with a collection of specialized audio and speech models. The idea is that you can describe what you want in plain text, and the system figures out which underlying model to use and runs it for you, whether that means generating speech from text, transcribing spoken audio, separating voices in a recording, or synthesizing a talking face video from an audio clip. The project covers four broad categories of audio tasks. In the speech category, it can convert text to spoken audio using several different synthesizer models, recognize and transcribe speech, enhance audio quality, separate overlapping speakers, and convert mono audio to spatial (binaural) audio. In the singing category, it can generate sung vocal performances from text lyrics. In the general audio category, it can create sound effects from text descriptions, fill in missing portions of audio, generate sounds based on an input image, and detect or extract specific sounds from a recording. It also includes a talking head task, which means generating a video of a face speaking from an audio input. This is a research implementation, not a polished commercial product. Many of the individual models listed in the README are marked as works in progress, meaning they are included but not fully functional yet. Getting it running requires following a separate setup guide and installing multiple pre-trained models. The project is accompanied by a published research paper on arXiv and a live demo on Hugging Face for those who want to try the concept without setting up the code locally. The README for this repository is brief and points mostly to other documentation rather than explaining everything in one place.
← aigc-audio on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.