explaingit

instr-io/ml

0PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Open-source vocal-removal model behind instr.io. A U-Net with bidirectional Mamba blocks and cross-attention bottleneck masks an STFT to produce a clean stereo instrumental.

Mindmap

mindmap
  root((instr-io-ml))
    Inputs
      Stereo audio file
      STFT bands
    Outputs
      Instrumental track
      Magnitude phase mask
    Use Cases
      Karaoke generation
      Music remixing
      Source separation research
    Tech Stack
      Python
      PyTorch
      Mamba
      STFT

Things people build with this

USE CASE 1

Run inference on a stereo song and get back a vocals-removed instrumental of the same length.

USE CASE 2

Train a custom vocal-removal model on your own dataset using the documented STFT and band layout.

USE CASE 3

Study a U-Net with bidirectional Mamba blocks and cross-attention bottleneck as a reference architecture for source separation.

USE CASE 4

Reuse the magnitude + band-wise log magnitude + SI-SDR loss recipe in other audio masking tasks.

Tech stack

PythonPyTorchMambaSTFT

Getting it running

Difficulty · hard Time to first run · 1day+

Install, dataset, and training steps live in a separate SETUP.md; serious training expects GPU and the Mamba CUDA kernels installed.

License is not stated in the available content.

In plain English

This repository holds the open source model behind instr.io, a project that strips vocals out of a song so you are left with just the instrumental track. You give it a normal stereo audio file and it gives back a stereo instrumental of the same length. The actual install steps, dataset layout, training commands, and inference instructions live in a separate SETUP.md file. Under the hood the audio is first broken into a frequency representation using a 2048 point Short Time Fourier Transform. That produces around a thousand frequency bins per time slice, which the code then groups into 72 uneven bands. The bands cover the vocal range from roughly 80 Hz to 4 kHz with more detail than the rest of the spectrum, since vocals are the main thing the model is trying to isolate. The model itself is shaped like a U-Net, a common pattern that shrinks the signal down through several stages and then expands it back up. Each stage uses bidirectional Mamba blocks, a newer kind of sequence layer. At the deepest point the model alternates Mamba blocks with global self attention, which lets every part of the audio talk to every other part. On the way back up the decoder does not look at the entire encoder output directly. Instead, each encoder level is pooled into a small set of summary tokens, around 160 per band in total, and the decoder reads from that compressed memory using cross attention. Gated skip connections let the decoder choose how much fine detail from the encoder it wants to pull back in. The final layer predicts a magnitude and a phase value per frequency bin for each stereo channel. Those numbers become a mask that gets multiplied against the original frequency representation, and the result is converted back into a regular audio waveform. Training uses a mix of magnitude loss, a band wise log magnitude loss, and an SI-SDR loss that scores how clean the separation sounds.

Copy-paste prompts

Prompt 1
Follow SETUP.md in instr-io/ml to run inference on a .wav file and produce an instrumental output.
Prompt 2
Explain how instr-io/ml groups 2048-point STFT bins into 72 uneven bands and why the vocal range gets more detail.
Prompt 3
Walk me through the U-Net + bidirectional Mamba + global self-attention architecture in instr-io/ml.
Prompt 4
Replace the bidirectional Mamba blocks in instr-io/ml with Transformer blocks and discuss the tradeoffs.
Prompt 5
Sketch how the SI-SDR loss combines with band-wise log magnitude loss during training of instr-io/ml.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.