explaingit

nl8590687/asrt_speechrecognition

8,373PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

TLDR

A research-grade Chinese speech recognition system that converts spoken Mandarin audio into written text using deep learning, runnable as an API server so other apps can send audio and receive transcriptions.

Mindmap

mindmap
  root((repo))
    What it does
      Mandarin speech to text
      Two-stage pipeline
      HTTP and gRPC API
    How it works
      Acoustic model CTC
      Language model
      Phonetic pinyin bridge
    Setup options
      Docker API server
      Train from scratch
      Pre-trained models
    Client SDKs
      Python Go Java
      Windows desktop
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Study how a two-stage acoustic-plus-language-model pipeline for Mandarin Chinese speech recognition is built and trained.

USE CASE 2

Run the pre-trained model as an API server and send audio files to it from your own app to get back Chinese text transcriptions.

USE CASE 3

Train the acoustic model from scratch using publicly available Chinese speech datasets totaling over a thousand hours of audio.

USE CASE 4

Build a Mandarin transcription feature into your own project using the provided client SDKs for Python, Go, Java, or Windows.

Tech stack

PythonTensorFlowDockerHTTPgRPC

Getting it running

Difficulty · hard Time to first run · 1h+

Training requires an NVIDIA GPU with at least 11 GB of VRAM, running the pre-trained model via Docker is easier but still needs a suitable GPU environment.

GPL v3.0, you can use and modify it freely, but any software you distribute that incorporates it must also be released under GPL.

In plain English

ASRT is a Chinese speech recognition system built with deep learning techniques using Python and TensorFlow. It listens to a short audio clip of spoken Mandarin Chinese and converts it to written text. The project is research-focused and intended for people who want to study or build on Chinese automatic speech recognition, not for plug-in commercial use. The system works in two stages. The first stage is an acoustic model that takes an audio file and produces a sequence of Chinese phonetic symbols (called pinyin). This model uses deep convolutional neural networks combined with a technique called CTC, which handles the fact that audio and text do not line up character by character. The second stage is a language model that takes the phonetic sequence and converts it to actual Chinese characters, using a statistical probability approach. To train the system from scratch you need a reasonably powerful machine: at least a 4-core CPU, 16 GB of RAM, and an NVIDIA GPU with 11 GB or more of graphics memory. Training uses publicly available Chinese speech datasets, and the project lists six datasets totaling over a thousand hours of audio. The best-performing version of the acoustic model achieves around 85% accuracy on phonetic recognition when tested against held-out data. Once trained, the system can be run as an API server over HTTP or gRPC, so other software can send audio data and receive transcribed text. The project provides separate client SDKs for Windows, Python, Go, and Java to make calling the server straightforward. If you do not want to train anything yourself, pre-trained model files are included in the downloadable release packages. A Docker image is available for running the API server without manual setup, though training still requires a suitable GPU environment. The project is licensed under GPL v3.0.

Copy-paste prompts

Prompt 1
I want to run ASRT Speech Recognition as an API server using Docker without training anything. Give me the exact Docker command to start it and a Python example that sends a WAV file and prints the transcription.
Prompt 2
Walk me through the two stages of the ASRT pipeline: what does the acoustic model output, and how does the language model turn that into Chinese characters?
Prompt 3
I want to train the ASRT acoustic model on one of the listed Chinese speech datasets. What are the hardware requirements, which dataset should I start with, and what command kicks off training?
Prompt 4
Show me how to call the ASRT gRPC API from Python: import, set up the channel, send an audio file, and print the returned text.
Open on GitHub → Explain another repo

← nl8590687 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.