bovod-sjtu/holitok

★ 17PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((holitok))
    What it does
      Encode audio to latents
      Extract speech features
      Reconstruct audio
    Models
      HoliTok-Base
      HoliTok-Unite
      Auto-download weights
    Tech stack
      Python
      PyTorch
      CUDA
      Hugging Face
    Use cases
      Train speech models
      Build classifiers
      Research compression

mindmap root((holitok)) What it does Encode audio to latents Extract speech features Reconstruct audio Models HoliTok-Base HoliTok-Unite Auto-download weights Tech stack Python PyTorch CUDA Hugging Face Use cases Train speech models Build classifiers Research compression

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Compress audio files into compact latent representations to use as training data for a speech generation model.

USE CASE 2

Extract 1536-dimensional semantic speech features from a recording to feed into a text classifier or language model.

USE CASE 3

Reconstruct audio from stored latents to measure the quality of a compressed speech representation.

Tech stack

PythonPyTorchCUDAHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Requires Python 3.10+, PyTorch 2.8 with CUDA, and a CUDA-capable GPU, CPU inference works but is very slow for long audio files.

In plain English

HoliTok is a research library for converting audio into compact numerical representations and back again. It is designed for speech processing tasks: given an audio file, it encodes the audio into a compressed format called latents, and can reconstruct audio from those latents or extract higher-level features that capture the meaning of what was said. The system uses a VAE (variational autoencoder), a type of model that learns to compress data into a smaller representation space. HoliTok operates at 48 kHz audio quality, which is higher than typical speech models. Two pre-trained model variants are available: HoliTok-Base and HoliTok-Unite. Both download their weights automatically from Hugging Face on first use. The library has three main operations. Encoding converts a .wav file into a latents file. Semantic feature extraction takes those latents and produces a 1536-dimensional feature vector per time step, intended to capture the content of speech rather than its acoustic details. Reconstruction takes the latents and produces a new .wav file. All three operations are available as Python API calls, command-line commands, or environment-variable-driven shell scripts for batch jobs. Practical uses include training speech generation models (where you work with compressed audio representations rather than raw waveforms), building speech understanding systems (where the semantic features serve as input to a classifier or language model), or researching audio compression and reconstruction quality. The library requires Python 3.10 or newer and PyTorch 2.8 with CUDA. It is published alongside a research paper on arXiv from a team at Shanghai Jiao Tong University, covering the dual capabilities of the tokenization approach for both generating and understanding speech.

Copy-paste prompts

Prompt 1

Use HoliTok to encode a batch of .wav files to latents and then reconstruct them. Show me how to measure reconstruction quality compared to the originals.

Prompt 2

I want to train a speech generation model using HoliTok latents. Walk me through encoding a dataset folder and what shape the latent tensors will be.

Prompt 3

Extract semantic features from a .wav file using HoliTok-Unite and show me how to use the 1536-dimensional output as input to a PyTorch classifier.

Prompt 4

What is the difference between HoliTok-Base and HoliTok-Unite, and which should I use for a speech understanding task versus a speech synthesis task?

Open on GitHub → Explain another repo

← bovod-sjtu on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.