explaingit

kizuna-intelligence/irodori-tts-lite

53PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

TLDR

Inference runtime that runs the Irodori Japanese DiT TTS model with GPTQ 4-bit quantization, shrinking VRAM use from 1.88GB to about 552MB with preserved audio quality.

Mindmap

mindmap
  root((Irodori TTS Lite))
    Inputs
      Japanese text
      Speaker reference
    Outputs
      Synthesized speech
      Memory benchmarks
    Use Cases
      Low VRAM TTS
      Edge GPU inference
      Quantization study
    Tech Stack
      Python
      PyTorch
      Triton
      GPTQ
      safetensors
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run Irodori Japanese TTS on a small consumer GPU under 1GB of VRAM

USE CASE 2

Patch an existing Irodori-TTS install to load 4-bit quantized weights

USE CASE 3

Benchmark int4 vs fp32 latency and CER on a Blackwell-class GPU

USE CASE 4

Study GPTQ calibration and Triton fused kernels for DiT linear layers

Tech stack

PythonPyTorchTritonCUDAsafetensors

Getting it running

Difficulty · hard Time to first run · 1h+

Needs a recent NVIDIA GPU with Triton support and pyopenjtalk for Japanese G2P.

MIT license, free for any use including commercial, just keep the copyright notice.

In plain English

Irodori-TTS-Lite is a small inference runtime that runs a Japanese text-to-speech model using 4-bit quantization. The base model is Irodori-TTS, a DiT-style speech synthesizer, and this project's goal is to shrink it so it fits on a much smaller GPU without losing audio quality. The original 32-bit checkpoint is 1.88 GB on disk, the int4 version published here is 279 MB, and the model alone needs only about 552 MB of GPU memory at peak. There is also an option called --codec-int4 that pushes the DACVAE audio codec into 4-bit as well, so the whole end-to-end pipeline (the DiT, the codec, and the tokenizer) fits in roughly 1 GB of VRAM. The README includes detailed benchmark tables for a Blackwell-generation RTX PRO 4000, showing latency for full-precision and 4-bit modes side by side. Audio quality is reported as preserved, with character error rate at 0 percent and speaker similarity scores very close to the FP32 baseline. The package is self-contained: at runtime you only need PyTorch, Triton, and safetensors. The DiT block's linear layers use a fused Triton kernel for GPTQ-packed 4-bit weights, while smaller pieces like the AdaLN projections and the encoder are kept in fp16 because the launch overhead of many tiny GPU kernels would otherwise wipe out the int4 speed gains. Quantized weights are downloaded automatically from Hugging Face the first time you run inference. To use it, you pip install from the GitHub URL, install pyopenjtalk for the example script, then either call run_tts.py or import the library and call patch(). Doing so swaps in a 4-bit-aware checkpoint loader so the existing irodori_tts code keeps working. The README also describes a separate path for Irodori-TTS-500M-v3, including how to graft the v3 duration predictor onto v2 models that lack one. A second section walks through measure_peak_memory.py for verifying VRAM use yourself, and discusses why GPTQ with real calibration data was necessary, random Gaussian calibration produced an unusable CER of about 33 percent. The repo is MIT licensed and links to architecture notes under docs/architecture.md for the deeper design rationale.

Copy-paste prompts

Prompt 1
Set up Irodori-TTS-Lite on a 6GB GPU and produce a wav file from a sample Japanese sentence
Prompt 2
Walk me through the fused Triton kernel for GPTQ-packed 4-bit weights and explain why AdaLN stays fp16
Prompt 3
Use measure_peak_memory.py to compare VRAM between --codec-int4 on and off and save a markdown table
Prompt 4
Port the patch() approach in Irodori-TTS-Lite to a different DiT-based TTS checkpoint
Prompt 5
Show me how to graft the v3 duration predictor onto a v2 Irodori model end to end
Open on GitHub → Explain another repo

← kizuna-intelligence on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.