Analysis updated 2026-05-18
Generate sound effects for a game or video by typing a text description and getting a .wav file back.
Batch-produce ambient audio files from a list of text prompts using the built-in batch mode.
Run a large text-to-audio model entirely on your own machine without any cloud API costs.
Combine speech, music, and environment tags in one call to produce layered audio scenes.
| audiohacking/audiogen.cpp | adeliox/klein-head-swap | ats4321/ragit | |
|---|---|---|---|
| Stars | 4 | 4 | 4 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 4/5 | 3/5 | 2/5 |
| Audience | developer | designer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires building from source with make and downloading model files of 3 to 6 GB.
audiogen.cpp is a C++ program that lets you generate audio clips from text descriptions on your own computer. You type something like "a dog barking loudly" or "rain falling on a window," and it produces a .wav audio file matching that description. The AI model behind this is called Dasheng-AudioGen, a large two-billion-parameter model designed to produce speech, music, sound effects, and ambient sounds all from the same system. The project is built in C++17, which makes it run much faster than the original Python version. According to the benchmarks in the README, the C++ build is roughly 3.7 times faster than Python on Apple hardware. For a 10-second audio clip the C++ version finishes in about 6 seconds rather than 22 seconds. There are build options for Apple Metal on macOS, plain CPU processing, and CUDA for NVIDIA GPUs on Linux. Getting started requires cloning the repo, building it with a single make command, and downloading model files ranging from about 2.8 GB to 5.7 GB depending on the quality level you want. The models come in three sizes: full precision for best quality, Q8 at 33% smaller, and Q4 at 51% smaller for the smallest footprint. Once built, you run a command-line program pointing at those model files and supply a text description. The command-line tool accepts several input tags beyond a plain caption. You can layer speech, music, environmental sounds, and sound effects in a single call by using flags such as --speech, --music, --env, and --sfx alongside --caption. There is also a batch mode where you supply a text file of prompts and the program processes them all at once, writing individual .wav files to an output directory. The project is marked experimental. It is released under the Apache 2.0 license, which allows free use including commercial projects. The model weights are downloaded from Hugging Face in a format called GGUF, which is what the GGML inference engine reads.
A C++ command-line tool that generates audio clips from text descriptions, running a 2-billion-parameter AI model locally at about 3.7x the speed of Python.
Mainly Python. The stack also includes C++17, GGML, Python.
Use freely for any purpose including commercial projects, as long as you keep the copyright and license notice.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.