Run inference on a stereo song and get back a vocals-removed instrumental of the same length.
Train a custom vocal-removal model on your own dataset using the documented STFT and band layout.
Study a U-Net with bidirectional Mamba blocks and cross-attention bottleneck as a reference architecture for source separation.
Reuse the magnitude + band-wise log magnitude + SI-SDR loss recipe in other audio masking tasks.
Install, dataset, and training steps live in a separate SETUP.md; serious training expects GPU and the Mamba CUDA kernels installed.
This repository holds the open source model behind instr.io, a project that strips vocals out of a song so you are left with just the instrumental track. You give it a normal stereo audio file and it gives back a stereo instrumental of the same length. The actual install steps, dataset layout, training commands, and inference instructions live in a separate SETUP.md file. Under the hood the audio is first broken into a frequency representation using a 2048 point Short Time Fourier Transform. That produces around a thousand frequency bins per time slice, which the code then groups into 72 uneven bands. The bands cover the vocal range from roughly 80 Hz to 4 kHz with more detail than the rest of the spectrum, since vocals are the main thing the model is trying to isolate. The model itself is shaped like a U-Net, a common pattern that shrinks the signal down through several stages and then expands it back up. Each stage uses bidirectional Mamba blocks, a newer kind of sequence layer. At the deepest point the model alternates Mamba blocks with global self attention, which lets every part of the audio talk to every other part. On the way back up the decoder does not look at the entire encoder output directly. Instead, each encoder level is pooled into a small set of summary tokens, around 160 per band in total, and the decoder reads from that compressed memory using cross attention. Gated skip connections let the decoder choose how much fine detail from the encoder it wants to pull back in. The final layer predicts a magnitude and a phase value per frequency bin for each stereo channel. Those numbers become a mask that gets multiplied against the original frequency representation, and the result is converted back into a regular audio waveform. Training uses a mix of magnitude loss, a band wise log magnitude loss, and an SI-SDR loss that scores how clean the separation sounds.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.