nl8590687/asrt_speechrecognition

★ 8,373PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Mandarin speech to text
      Two-stage pipeline
      HTTP and gRPC API
    How it works
      Acoustic model CTC
      Language model
      Phonetic pinyin bridge
    Setup options
      Docker API server
      Train from scratch
      Pre-trained models
    Client SDKs
      Python Go Java
      Windows desktop

mindmap root((repo)) What it does Mandarin speech to text Two-stage pipeline HTTP and gRPC API How it works Acoustic model CTC Language model Phonetic pinyin bridge Setup options Docker API server Train from scratch Pre-trained models Client SDKs Python Go Java Windows desktop

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Study how a two-stage acoustic-plus-language-model pipeline for Mandarin Chinese speech recognition is built and trained.

USE CASE 2

Run the pre-trained model as an API server and send audio files to it from your own app to get back Chinese text transcriptions.

USE CASE 3

Train the acoustic model from scratch using publicly available Chinese speech datasets totaling over a thousand hours of audio.

USE CASE 4

Build a Mandarin transcription feature into your own project using the provided client SDKs for Python, Go, Java, or Windows.

Tech stack

PythonTensorFlowDockerHTTPgRPC

Getting it running

Difficulty · hard Time to first run · 1h+

Training requires an NVIDIA GPU with at least 11 GB of VRAM, running the pre-trained model via Docker is easier but still needs a suitable GPU environment.

GPL v3.0, you can use and modify it freely, but any software you distribute that incorporates it must also be released under GPL.

In plain English

ASRT is a Chinese speech recognition system built with deep learning techniques using Python and TensorFlow. It listens to a short audio clip of spoken Mandarin Chinese and converts it to written text. The project is research-focused and intended for people who want to study or build on Chinese automatic speech recognition, not for plug-in commercial use. The system works in two stages. The first stage is an acoustic model that takes an audio file and produces a sequence of Chinese phonetic symbols (called pinyin). This model uses deep convolutional neural networks combined with a technique called CTC, which handles the fact that audio and text do not line up character by character. The second stage is a language model that takes the phonetic sequence and converts it to actual Chinese characters, using a statistical probability approach. To train the system from scratch you need a reasonably powerful machine: at least a 4-core CPU, 16 GB of RAM, and an NVIDIA GPU with 11 GB or more of graphics memory. Training uses publicly available Chinese speech datasets, and the project lists six datasets totaling over a thousand hours of audio. The best-performing version of the acoustic model achieves around 85% accuracy on phonetic recognition when tested against held-out data. Once trained, the system can be run as an API server over HTTP or gRPC, so other software can send audio data and receive transcribed text. The project provides separate client SDKs for Windows, Python, Go, and Java to make calling the server straightforward. If you do not want to train anything yourself, pre-trained model files are included in the downloadable release packages. A Docker image is available for running the API server without manual setup, though training still requires a suitable GPU environment. The project is licensed under GPL v3.0.

Copy-paste prompts

Prompt 1

I want to run ASRT Speech Recognition as an API server using Docker without training anything. Give me the exact Docker command to start it and a Python example that sends a WAV file and prints the transcription.

Prompt 2

Walk me through the two stages of the ASRT pipeline: what does the acoustic model output, and how does the language model turn that into Chinese characters?

Prompt 3

I want to train the ASRT acoustic model on one of the listed Chinese speech datasets. What are the hardware requirements, which dataset should I start with, and what command kicks off training?

Prompt 4

Show me how to call the ASRT gRPC API from Python: import, set up the channel, send an audio file, and print the returned text.

Open on GitHub → Explain another repo

← nl8590687 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.