explaingit

gauravvij/kokoro-tts-vs-supertonic-3-tts

3PythonAudience · researcherComplexity · 3/5ActiveSetup · moderate

TLDR

CPU-only head-to-head benchmark of Kokoro 82M and Supertonic 3 text-to-speech models, with scripts that produce a Markdown report, charts, and WAV samples.

Mindmap

mindmap
  root((tts-benchmark))
    Inputs
      Text prompts
      Kokoro and Supertonic models
      CPU runtime
    Outputs
      CSV timings
      Markdown report
      Matplotlib charts
      WAV samples
    Use Cases
      Compare TTS speed and quality
      Pick a CPU friendly TTS model
      Reproduce a benchmark report
    Tech Stack
      Python
      PyTorch
      ONNX Runtime
      espeak-ng
      matplotlib

Things people build with this

USE CASE 1

Pick a CPU-friendly TTS model by comparing speed and audio quality

USE CASE 2

Reproduce a 120-run TTS benchmark on a CPU with espeak-ng

USE CASE 3

Generate a Markdown report and matplotlib charts from TTS timing data

USE CASE 4

Listen to side-by-side WAV samples of Kokoro and Supertonic across text lengths

Tech stack

PythonPyTorchONNX Runtimeespeak-ngmatplotlibpandas

Getting it running

Difficulty · moderate Time to first run · 30min

Needs espeak-ng installed at the OS level plus a Python venv with PyTorch and ONNX Runtime, and you must download Kokoro ONNX model files from Hugging Face before running.

In plain English

This repository is a head-to-head benchmark of two text-to-speech models, Kokoro 82M and Supertonic 3, both running on a regular CPU with no GPU. Text to speech, or TTS, is software that turns written text into spoken audio. The point of the comparison is to see which model gives better trade-offs between how fast it generates audio and how natural that audio sounds. The README states up front that the benchmark itself was designed, written, and executed end to end by an autonomous coding agent called Neo from a single prompt, with no manual coding or configuration. The benchmark was run on an AMD EPYC 7763 with 4 cores and 15.6GB of RAM using Python 3.11. The results are summarized in a small table. Supertonic-3 in 2-step mode is the fastest at about 6.1 times real-time speed, but the audio quality is described as poor and robotic. Supertonic-3 in 5-step mode runs at 3.2 times real-time with audio quality described as good and clear. Kokoro 82M in both its PyTorch and ONNX forms runs at about 2 times real-time but has excellent, human-like quality. The author calls Supertonic 2-step the speed winner, 5-step the balance pick, and Kokoro the quality winner. The repo contains a benchmark.py script that runs 120 timed measurements, a report.py script that turns the raw numbers into a Markdown report and matplotlib charts, and a results folder with the CSV of raw timings, the rendered report, two charts comparing real-time factor and latency against text length, and 24 generated WAV audio samples covering each configuration and text length combination. A separate blog_post.md writes up the findings in more depth. To reproduce the benchmark yourself, the README walks through installing the espeak-ng system dependency, creating a Python virtual environment, installing the supertonic, kokoro, kokoro-onnx, and onnxruntime packages along with soundfile, matplotlib, pandas, numpy, and torch, then downloading the Kokoro ONNX model files from Hugging Face before running benchmark.py and report.py. The repository has 3 stars at the time of writing.

Copy-paste prompts

Prompt 1
Walk me through benchmark.py and how it produces 120 timed measurements across configurations
Prompt 2
Help me set up espeak-ng, a Python 3.11 venv, and the kokoro and supertonic packages to rerun this benchmark
Prompt 3
Show me how report.py turns the raw CSV into the Markdown report and the real-time-factor chart
Prompt 4
Help me add a third TTS model into this benchmark on the same CPU
Prompt 5
Explain why Supertonic 5-step is the balance pick and Kokoro is the quality winner in the results
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.