explaingit

kizuna-intelligence/irodori-tts-lite

53Python

TLDR

Irodori-TTS-Lite is a small inference runtime that runs a Japanese text-to-speech model using 4-bit quantization.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

Irodori-TTS-Lite is a small inference runtime that runs a Japanese text-to-speech model using 4-bit quantization. The base model is Irodori-TTS, a DiT-style speech synthesizer, and this project's goal is to shrink it so it fits on a much smaller GPU without losing audio quality. The original 32-bit checkpoint is 1.88 GB on disk; the int4 version published here is 279 MB, and the model alone needs only about 552 MB of GPU memory at peak. There is also an option called --codec-int4 that pushes the DACVAE audio codec into 4-bit as well, so the whole end-to-end pipeline (the DiT, the codec, and the tokenizer) fits in roughly 1 GB of VRAM. The README includes detailed benchmark tables for a Blackwell-generation RTX PRO 4000, showing latency for full-precision and 4-bit modes side by side. Audio quality is reported as preserved, with character error rate at 0 percent and speaker similarity scores very close to the FP32 baseline. The package is self-contained: at runtime you only need PyTorch, Triton, and safetensors. The DiT block's linear layers use a fused Triton kernel for GPTQ-packed 4-bit weights, while smaller pieces like the AdaLN projections and the encoder are kept in fp16 because the launch overhead of many tiny GPU kernels would otherwise wipe out the int4 speed gains. Quantized weights are downloaded automatically from Hugging Face the first time you run inference. To use it, you pip install from the GitHub URL, install pyopenjtalk for the example script, then either call run_tts.py or import the library and call patch(). Doing so swaps in a 4-bit-aware checkpoint loader so the existing irodori_tts code keeps working. The README also describes a separate path for Irodori-TTS-500M-v3, including how to graft the v3 duration predictor onto v2 models that lack one. A second section walks through measure_peak_memory.py for verifying VRAM use yourself, and discusses why GPTQ with real calibration data was necessary; random Gaussian calibration produced an unusable CER of about 33 percent. The repo is MIT licensed and links to architecture notes under docs/architecture.md for the deeper design rationale.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.