explaingit

xzf-thu/voices-in-the-wild-bench

13PythonAudience · researcherComplexity · 3/5ActiveSetup · moderate

TLDR

Bilingual Chinese and English benchmark of 5000 noisy real and synthetic audio clips with a Python toolkit to score ASR models like Whisper and Canary.

Mindmap

mindmap
  root((voices-in-the-wild-bench))
    Inputs
      Audio clips
      JSONL records
      Model predictions
    Outputs
      CER scores
      WER scores
      Leaderboard entries
    Use Cases
      ASR robustness testing
      Speech model evaluation
      Noise condition analysis
    Tech Stack
      Python
      Whisper
      NeMo
      HuggingFace

Things people build with this

USE CASE 1

Benchmark a new ASR model against noisy real-world Chinese and English audio

USE CASE 2

Reproduce CER and WER scores for Whisper-Large-v3 and Canary-1b-v2

USE CASE 3

Submit results to the public leaderboard for speech recognition robustness

USE CASE 4

Add a custom model wrapper to evaluate it across eight acoustic conditions

Tech stack

PythonWhisperNeMoTransformers

Getting it running

Difficulty · moderate Time to first run · 30min

Full dataset download lives on Hugging Face, but eight example audio files in the repo let you smoke-test the pipeline first.

In plain English

Voices-in-the-Wild-Bench is a benchmark dataset and a small evaluation toolkit for speech recognition systems. The goal is to measure how well speech and voice assistant systems hold up when the audio is messy in ways that are common in everyday life, rather than the clean studio recordings that most academic benchmarks use. It covers both Chinese and English. The benchmark contains 5,000 audio examples. Of these, 3,500 are synthetic speech with controlled perturbations and 1,500 are real recordings from sixteen human speakers. The split between Mandarin Chinese and English is even, with 2,500 samples each. Every clip is tagged with one of eight acoustic conditions: noise, far field, obstructed speech, distortion, recording artifacts, echo, dropout, and a mixed category that combines several conditions in one clip. The repository ships eight short example audio files, one per category, so that you can smoke-test your evaluation pipeline before downloading the full set from Hugging Face. Each sample is stored as a JSONL record with an index, an audio path, an instruction, a reference answer, a subset label that encodes the source type, language, and acoustic condition, and an empty prediction field that the evaluated model fills in. The README documents how to score predictions and how to run models. Chinese audio is scored with character error rate and English audio with word error rate. The evaluate_predictions.py script reports an overall score, a language-wise breakdown, and a real versus synthetic breakdown for each acoustic category. There is also a run_inference.py script for running included model wrappers, with the first public wrappers being Whisper-Large-v3 from OpenAI through the Transformers pipeline, Mega-ASR which is described as the public name for the authors' own merged_v2 model, and Canary-1b-v2 through NVIDIA NeMo. The repository links out to a leaderboard site, a paper, the dataset on Hugging Face, and an issues page for submitting new results. The release notes at the top of the README show that the initial skeleton went up on 2026-05-16, and reproducible evaluation utilities, example records, and the first two model wrappers were added two days later.

Copy-paste prompts

Prompt 1
Show me how to run evaluate_predictions.py on a JSONL file of my model outputs
Prompt 2
Walk me through writing a model wrapper for run_inference.py against this benchmark
Prompt 3
Explain how the eight acoustic categories like far field and dropout are tagged in the dataset
Prompt 4
Compare CER and WER reporting in this benchmark for Chinese versus English samples
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.