explaingit

moiseshorta/codicodec-flow

17PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

A flow-matching transformer that continues a music clip by generating in the latent space of the CoDiCodec audio codec, runnable on Apple Silicon laptops.

Mindmap

mindmap
  root((CoDiCodec-Flow))
    Inputs
      Audio dataset
      Prompt clip
      Checkpoint file
    Outputs
      Continued audio
      Codec features
      Training samples
    Use Cases
      Continue a song
      Train music model
      Experiment with flow matching
    Tech Stack
      Python
      PyTorch
      CoDiCodec
      Apple Silicon

Things people build with this

USE CASE 1

Continue a short stereo audio prompt with new music by sampling from a trained CoDiCodec-Flow checkpoint.

USE CASE 2

Preprocess a folder of WAV, MP3, or FLAC tracks into compact codec features for training the model.

USE CASE 3

Train a 20M or 97M parameter transformer on an Apple Silicon laptop with at least 36 GB of unified memory.

USE CASE 4

Compare audio quality between the Euler and Heun ODE solvers at different sampling step counts.

Tech stack

PythonPyTorchCoDiCodec

Getting it running

Difficulty · hard Time to first run · 1day+

Recommended setup needs an Apple Silicon machine with around 36 GB of unified memory, and full training takes millions of steps.

In plain English

CoDiCodec-Flow is a research project by Moises Horta Valenzuela that trains a model to keep playing music after you give it a short audio clip. Instead of generating raw audio samples one at a time, which is slow, the model works inside a compact representation produced by another tool called CoDiCodec. That codec encodes stereo audio at 48 kHz down to a much smaller stream, roughly 11.7 numbers per second across 64 channels. Generating in that small space and then decoding back to audio is much faster than generating sound directly. The generator is a transformer arranged so that each chunk of output only looks at past chunks, which is what makes streaming continuation possible. It is trained with a technique called Flow Matching, which teaches the model to push random noise toward the codec's latent values. The README says the codec's latents are close to a unit Gaussian distribution after a fixed transform, which makes them a clean target for this kind of training. The whole pipeline is written to work on Apple Silicon, so a 36 GB Apple laptop can train and run the model without needing an external GPU. There is a command-line script with three main modes. The preprocess mode walks through a folder of audio files in formats like WAV, MP3, or FLAC and writes each one out as a small file of codec features. The train mode reads those files and trains the model, writing checkpoints every 50 steps and producing periodic sample audio. The sample mode loads a checkpoint and either continues a prompt audio file or generates audio without a prompt. The README lists several tunables. A smaller model around 20 million parameters trains faster, while the default size of about 97 million parameters is recommended for machines with 36 GB of memory or more. At generation time you pick how many sampling steps to run, with fewer steps producing audio faster and more steps producing higher quality. Two ODE solvers are available, called Euler and Heun, with Heun aimed at quality. The author notes that the project's main checkpoint was trained for 6.86 million steps.

Copy-paste prompts

Prompt 1
Set up CoDiCodec-Flow on a 36 GB M3 Max laptop. List the exact preprocess, train, and sample commands to continue a 10 second prompt clip.
Prompt 2
I want to train a smaller 20M parameter variant of CoDiCodec-Flow on 50 hours of techno. Suggest batch size, learning rate, and how often to checkpoint.
Prompt 3
Write a script that runs CoDiCodec-Flow sampling for both Euler and Heun solvers at 10, 30, and 100 steps and saves the outputs for A/B listening.
Prompt 4
Explain how CoDiCodec-Flow makes the codec latents close to a unit Gaussian and why Flow Matching needs that property. Reference the README claims.
Prompt 5
Adapt CoDiCodec-Flow to accept a text prompt alongside the audio prompt by injecting CLAP embeddings. Sketch the model and training changes.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.