xzf-thu/mega-asr

Analysis updated 2026-06-24

★ 93PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((Mega-ASR))
    Inputs
      Noisy audio recordings
      Far-field microphone clips
    Outputs
      Transcribed text
      Lower WER scores
    Use Cases
      Transcribe field recordings
      Benchmark against Whisper
      Research robust ASR
    Tech Stack
      Python
      PyTorch
      Hugging Face

mindmap root((Mega-ASR)) Inputs Noisy audio recordings Far-field microphone clips Outputs Transcribed text Lower WER scores Use Cases Transcribe field recordings Benchmark against Whisper Research robust ASR Tech Stack Python PyTorch Hugging Face

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Transcribe noisy real-world audio with strong background interference

USE CASE 2

Benchmark in-the-wild speech recognition against Whisper and Qwen3-ASR

USE CASE 3

Fine-tune an ASR foundation model on custom acoustic conditions

USE CASE 4

Research A2S-SFT and DG-WGPO training recipes

What is it built with?

PythonPyTorchHugging Face

How does it compare?

	xzf-thu/mega-asr	oft3r/agentic-trading-desk	yoheinakajima/activegraph
Stars	93	90	96
Language	Python	Python	Python
Setup difficulty	hard	moderate	easy
Complexity	4/5	3/5	4/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Foundation model trained on 2.6M samples, expect GPU plus Hugging Face downloads and a multi-step inference setup.

In plain English

MEGA-ASR is a speech recognition model from a group at Tsinghua University aimed at transcribing audio captured in messy real-world conditions, rather than the clean studio recordings that most speech models are tested on. The README frames it as a foundation model for what the authors call in-the-wild speech recognition, meaning audio with background noise, far-field microphones, obstructions, echoes and reverberation, recording artifacts, electronic distortion, and dropped pieces of transmission. The training set is described as 2.6 million samples covering 7 atomic acoustic conditions and 54 compound scenarios where those conditions stack on top of each other. The authors report up to roughly 30 percent gains over leading open and closed source models on these harder cases. Two training techniques are named in the README: A2S-SFT for supervised fine-tuning, and a reinforcement learning step called DG-WGPO. The README does not explain what those acronyms stand for or how they work in detail, so a non-technical reader will mostly take them as the labels of the recipes used. Most of the README is a side-by-side comparison table where short audio clips are transcribed by MEGA-ASR and by other systems, including Qwen3-ASR, Gemini-3-Pro, Seed-ASR, and Whisper. Each row shows the ground-truth text, each model's transcription, and a Word Error Rate score. In the examples shown, MEGA-ASR produces lower error rates on the hard clips while the other systems often return empty output, hallucinate unrelated text, or drop large portions of the sentence. The project links out to a technical report on arXiv, the Voices-in-the-Wild-2M training dataset on Hugging Face, the model weights on Hugging Face, a separate benchmark repository called Voices-in-the-Wild-Bench, and a project page. The README in this repository is mostly the marketing-style introduction and the comparison samples.

Copy-paste prompts

Prompt 1

Show me how to load the Mega-ASR weights from Hugging Face and transcribe a noisy wav file

Prompt 2

Compare Mega-ASR and Whisper on three hard clips and compute Word Error Rate for each

Prompt 3

Summarize what the A2S-SFT and DG-WGPO training steps in Mega-ASR are doing at a high level

Prompt 4

Build a small benchmark script around Voices-in-the-Wild-Bench to evaluate my own ASR model

Frequently asked questions

What is mega-asr?

Tsinghua speech recognition foundation model tuned for noisy, far-field, in-the-wild audio, claiming up to 30 percent lower WER than Whisper and Qwen3-ASR on hard clips.

What language is mega-asr written in?

Mainly Python. The stack also includes Python, PyTorch, Hugging Face.

How hard is mega-asr to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is mega-asr for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.