MEGA-ASR is a speech recognition model from a group at Tsinghua University aimed at transcribing audio captured in messy real-world conditions, rather than the clean studio recordings that most speech models are tested on. The README frames it as a foundation model for what the authors call in-the-wild speech recognition, meaning audio with background noise, far-field microphones, obstructions, echoes and reverberation, recording artifacts, electronic distortion, and dropped pieces of transmission. The training set is described as 2.6 million samples covering 7 atomic acoustic conditions and 54 compound scenarios where those conditions stack on top of each other. The authors report up to roughly 30 percent gains over leading open and closed source models on these harder cases. Two training techniques are named in the README: A2S-SFT for supervised fine-tuning, and a reinforcement learning step called DG-WGPO. The README does not explain what those acronyms stand for or how they work in detail, so a non-technical reader will mostly take them as the labels of the recipes used. Most of the README is a side-by-side comparison table where short audio clips are transcribed by MEGA-ASR and by other systems, including Qwen3-ASR, Gemini-3-Pro, Seed-ASR, and Whisper. Each row shows the ground-truth text, each model's transcription, and a Word Error Rate score. In the examples shown, MEGA-ASR produces lower error rates on the hard clips while the other systems often return empty output, hallucinate unrelated text, or drop large portions of the sentence. The project links out to a technical report on arXiv, the Voices-in-the-Wild-2M training dataset on Hugging Face, the model weights on Hugging Face, a separate benchmark repository called Voices-in-the-Wild-Bench, and a project page. The README in this repository is mostly the marketing-style introduction and the comparison samples.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.