xzf-thu/voices-in-the-wild-bench

Analysis updated 2026-06-24

★ 11PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((voices-in-the-wild-bench))
    Inputs
      Audio clips
      JSONL records
      Model predictions
    Outputs
      CER scores
      WER scores
      Leaderboard entries
    Use Cases
      ASR robustness testing
      Speech model evaluation
      Noise condition analysis
    Tech Stack
      Python
      Whisper
      NeMo
      HuggingFace

mindmap root((voices-in-the-wild-bench)) Inputs Audio clips JSONL records Model predictions Outputs CER scores WER scores Leaderboard entries Use Cases ASR robustness testing Speech model evaluation Noise condition analysis Tech Stack Python Whisper NeMo HuggingFace

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Benchmark a new ASR model against noisy real-world Chinese and English audio

USE CASE 2

Reproduce CER and WER scores for Whisper-Large-v3 and Canary-1b-v2

USE CASE 3

Submit results to the public leaderboard for speech recognition robustness

USE CASE 4

Add a custom model wrapper to evaluate it across eight acoustic conditions

What is it built with?

PythonWhisperNeMoTransformers

How does it compare?

	xzf-thu/voices-in-the-wild-bench	2arons/llm-cli	an1x3r/anima-artist-mixer
Stars	11	11	11
Language	Python	Python	Python
Setup difficulty	moderate	easy	easy
Complexity	3/5	2/5	2/5
Audience	researcher	developer	designer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Full dataset download lives on Hugging Face, but eight example audio files in the repo let you smoke-test the pipeline first.

In plain English

Voices-in-the-Wild-Bench is a benchmark dataset and a small evaluation toolkit for speech recognition systems. The goal is to measure how well speech and voice assistant systems hold up when the audio is messy in ways that are common in everyday life, rather than the clean studio recordings that most academic benchmarks use. It covers both Chinese and English. The benchmark contains 5,000 audio examples. Of these, 3,500 are synthetic speech with controlled perturbations and 1,500 are real recordings from sixteen human speakers. The split between Mandarin Chinese and English is even, with 2,500 samples each. Every clip is tagged with one of eight acoustic conditions: noise, far field, obstructed speech, distortion, recording artifacts, echo, dropout, and a mixed category that combines several conditions in one clip. The repository ships eight short example audio files, one per category, so that you can smoke-test your evaluation pipeline before downloading the full set from Hugging Face. Each sample is stored as a JSONL record with an index, an audio path, an instruction, a reference answer, a subset label that encodes the source type, language, and acoustic condition, and an empty prediction field that the evaluated model fills in. The README documents how to score predictions and how to run models. Chinese audio is scored with character error rate and English audio with word error rate. The evaluate_predictions.py script reports an overall score, a language-wise breakdown, and a real versus synthetic breakdown for each acoustic category. There is also a run_inference.py script for running included model wrappers, with the first public wrappers being Whisper-Large-v3 from OpenAI through the Transformers pipeline, Mega-ASR which is described as the public name for the authors' own merged_v2 model, and Canary-1b-v2 through NVIDIA NeMo. The repository links out to a leaderboard site, a paper, the dataset on Hugging Face, and an issues page for submitting new results. The release notes at the top of the README show that the initial skeleton went up on 2026-05-16, and reproducible evaluation utilities, example records, and the first two model wrappers were added two days later.

Copy-paste prompts

Prompt 1

Show me how to run evaluate_predictions.py on a JSONL file of my model outputs

Prompt 2

Walk me through writing a model wrapper for run_inference.py against this benchmark

Prompt 3

Explain how the eight acoustic categories like far field and dropout are tagged in the dataset

Prompt 4

Compare CER and WER reporting in this benchmark for Chinese versus English samples

Frequently asked questions

What is voices-in-the-wild-bench?

Bilingual Chinese and English benchmark of 5000 noisy real and synthetic audio clips with a Python toolkit to score ASR models like Whisper and Canary.

What language is voices-in-the-wild-bench written in?

Mainly Python. The stack also includes Python, Whisper, NeMo.

How hard is voices-in-the-wild-bench to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is voices-in-the-wild-bench for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.