Continue a short stereo audio prompt with new music by sampling from a trained CoDiCodec-Flow checkpoint.
Preprocess a folder of WAV, MP3, or FLAC tracks into compact codec features for training the model.
Train a 20M or 97M parameter transformer on an Apple Silicon laptop with at least 36 GB of unified memory.
Compare audio quality between the Euler and Heun ODE solvers at different sampling step counts.
Recommended setup needs an Apple Silicon machine with around 36 GB of unified memory, and full training takes millions of steps.
CoDiCodec-Flow is a research project by Moises Horta Valenzuela that trains a model to keep playing music after you give it a short audio clip. Instead of generating raw audio samples one at a time, which is slow, the model works inside a compact representation produced by another tool called CoDiCodec. That codec encodes stereo audio at 48 kHz down to a much smaller stream, roughly 11.7 numbers per second across 64 channels. Generating in that small space and then decoding back to audio is much faster than generating sound directly. The generator is a transformer arranged so that each chunk of output only looks at past chunks, which is what makes streaming continuation possible. It is trained with a technique called Flow Matching, which teaches the model to push random noise toward the codec's latent values. The README says the codec's latents are close to a unit Gaussian distribution after a fixed transform, which makes them a clean target for this kind of training. The whole pipeline is written to work on Apple Silicon, so a 36 GB Apple laptop can train and run the model without needing an external GPU. There is a command-line script with three main modes. The preprocess mode walks through a folder of audio files in formats like WAV, MP3, or FLAC and writes each one out as a small file of codec features. The train mode reads those files and trains the model, writing checkpoints every 50 steps and producing periodic sample audio. The sample mode loads a checkpoint and either continues a prompt audio file or generates audio without a prompt. The README lists several tunables. A smaller model around 20 million parameters trains faster, while the default size of about 97 million parameters is recommended for machines with 36 GB of memory or more. At generation time you pick how many sampling steps to run, with fewer steps producing audio faster and more steps producing higher quality. Two ODE solvers are available, called Euler and Heun, with Heun aimed at quality. The author notes that the project's main checkpoint was trained for 6.86 million steps.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.