instr-io/ml

Analysis updated 2026-06-24

★ 0PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((instr-io-ml))
    Inputs
      Stereo audio file
      STFT bands
    Outputs
      Instrumental track
      Magnitude phase mask
    Use Cases
      Karaoke generation
      Music remixing
      Source separation research
    Tech Stack
      Python
      PyTorch
      Mamba
      STFT

mindmap root((instr-io-ml)) Inputs Stereo audio file STFT bands Outputs Instrumental track Magnitude phase mask Use Cases Karaoke generation Music remixing Source separation research Tech Stack Python PyTorch Mamba STFT

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run inference on a stereo song and get back a vocals-removed instrumental of the same length.

USE CASE 2

Train a custom vocal-removal model on your own dataset using the documented STFT and band layout.

USE CASE 3

Study a U-Net with bidirectional Mamba blocks and cross-attention bottleneck as a reference architecture for source separation.

USE CASE 4

Reuse the magnitude + band-wise log magnitude + SI-SDR loss recipe in other audio masking tasks.

What is it built with?

PythonPyTorchMambaSTFT

How does it compare?

	instr-io/ml	0xhassaan/nn-from-scratch	a-little-hoof/dsr
Stars	0	0	0
Language	Python	Python	Python
Setup difficulty	hard	moderate	hard
Complexity	5/5	4/5	5/5
Audience	researcher	developer	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Install, dataset, and training steps live in a separate SETUP.md, serious training expects GPU and the Mamba CUDA kernels installed.

License is not stated in the available content.

In plain English

This repository holds the open source model behind instr.io, a project that strips vocals out of a song so you are left with just the instrumental track. You give it a normal stereo audio file and it gives back a stereo instrumental of the same length. The actual install steps, dataset layout, training commands, and inference instructions live in a separate SETUP.md file. Under the hood the audio is first broken into a frequency representation using a 2048 point Short Time Fourier Transform. That produces around a thousand frequency bins per time slice, which the code then groups into 72 uneven bands. The bands cover the vocal range from roughly 80 Hz to 4 kHz with more detail than the rest of the spectrum, since vocals are the main thing the model is trying to isolate. The model itself is shaped like a U-Net, a common pattern that shrinks the signal down through several stages and then expands it back up. Each stage uses bidirectional Mamba blocks, a newer kind of sequence layer. At the deepest point the model alternates Mamba blocks with global self attention, which lets every part of the audio talk to every other part. On the way back up the decoder does not look at the entire encoder output directly. Instead, each encoder level is pooled into a small set of summary tokens, around 160 per band in total, and the decoder reads from that compressed memory using cross attention. Gated skip connections let the decoder choose how much fine detail from the encoder it wants to pull back in. The final layer predicts a magnitude and a phase value per frequency bin for each stereo channel. Those numbers become a mask that gets multiplied against the original frequency representation, and the result is converted back into a regular audio waveform. Training uses a mix of magnitude loss, a band wise log magnitude loss, and an SI-SDR loss that scores how clean the separation sounds.

Copy-paste prompts

Prompt 1

Follow SETUP.md in instr-io/ml to run inference on a .wav file and produce an instrumental output.

Prompt 2

Explain how instr-io/ml groups 2048-point STFT bins into 72 uneven bands and why the vocal range gets more detail.

Prompt 3

Walk me through the U-Net + bidirectional Mamba + global self-attention architecture in instr-io/ml.

Prompt 4

Replace the bidirectional Mamba blocks in instr-io/ml with Transformer blocks and discuss the tradeoffs.

Prompt 5

Sketch how the SI-SDR loss combines with band-wise log magnitude loss during training of instr-io/ml.

Frequently asked questions

What is ml?

Open-source vocal-removal model behind instr.io. A U-Net with bidirectional Mamba blocks and cross-attention bottleneck masks an STFT to produce a clean stereo instrumental.

What language is ml written in?

Mainly Python. The stack also includes Python, PyTorch, Mamba.

What license does ml use?

License is not stated in the available content.

How hard is ml to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is ml for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.